Compare commits

..

276 Commits

Author SHA1 Message Date
dependabot[bot]
58dad0212f chore(deps): bump torch in /backend/python/coqui
Bumps torch from 2.4.1 to 2.12.0+xpu.

---
updated-dependencies:
- dependency-name: torch
  dependency-version: 2.12.0+xpu
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-19 03:51:06 +00:00
Azteczek
cb502de309 feat: add flake.nix for dockerless setup (#9851)
* Add flake.nix

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

* Add flake.lock

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>

---------

Signed-off-by: Azteczek <243776410+Azteczek@users.noreply.github.com>
2026-05-18 15:23:10 +01:00
Richard Palethorpe
5d0b549049 feat(gallery): verify backend OCI images with keyless cosign (#9823)
* feat(gallery): verify backend OCI images with keyless cosign

Close a trust gap where a registry compromise or MITM could silently
replace a backend image: the gallery YAML tells LocalAI which image to
pull, but until now nothing verified the bytes came from our CI.

Consumer (pkg/oci/cosignverify):
- New package using sigstore-go to verify keyless-cosign signatures.
- OCI 1.1 referrers API + new bundle format (no legacy :tag.sig).
- Policy fields: Issuer / IssuerRegex / Identity / IdentityRegex /
  NotBefore. NotBefore is the revocation lever — keyless Fulcio certs
  are ephemeral so revocation is policy-side; advancing not_before in
  the gallery YAML invalidates every signature predating the cutoff.
- TUF trusted root cached process-wide so N backends from one gallery
  do 1 fetch, not N.

Plumbing:
- pkg/downloader: ImageVerifier interface + WithImageVerifier option
  threaded through DownloadFileWithContext. Verification runs between
  oci.GetImage and oci.ExtractOCIImage, with digest pinning via
  pinnedImageRef to close the TOCTOU window. Skips the verifier's HEAD
  when the ref is already digest-pinned.
- core/config: Gallery.Verification YAML block.
- core/gallery: backendDownloadOptions builds the verifier from the
  policy; applied on initial URI, mirrors, and tag fallbacks.
- core/gallery/upgrade: the upgrade path now routes through the same
  options builder. A regression Ginkgo spec pins this contract —
  without it, UpgradeBackend silently bypassed verification.
- core/cli: --require-backend-integrity (LOCALAI_REQUIRE_BACKEND_INTEGRITY)
  escalates missing policy / empty SHA256 from warn to hard-fail.

Producer (.github/workflows/backend_merge.yml):
- id-token: write at job scope (PR-fork-safe via existing event gate).
- sigstore/cosign-installer@v3 pinned to v2.4.1.
- After each docker buildx imagetools create, resolve the manifest
  list digest and run cosign sign --recursive --new-bundle-format
  --registry-referrers-mode=oci-1-1 against repo@digest. --recursive
  signs the index and every per-arch entry, matching how the consumer
  resolves a tag to a platform-specific manifest before verifying.

Rollout: backend/index.yaml has no `verification:` block yet, so this
PR is backward-compatible — installs proceed with a warning until the
gallery is populated. Strict mode is opt-in.

Assisted-by: claude-code:claude-opus-4-7 [Bash] [Edit] [Read] [Write] [WebSearch] [WebFetch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(gallery): plumb RequireBackendIntegrity through config instead of env

The previous implementation re-exported the --require-backend-integrity
CLI flag into LOCALAI_REQUIRE_BACKEND_INTEGRITY via os.Setenv, then
re-read it in core/gallery via os.Getenv. This leaked process state
into the gallery package and made the flag impossible to override
per-call or test without touching the env.

Add RequireBackendIntegrity to ApplicationConfig (with a matching
WithRequireBackendIntegrity AppOption) and thread the bool through
every install/upgrade path: InstallBackend, InstallBackendFromGallery,
UpgradeBackend, InstallModelFromGallery, InstallExternalBackend,
ApplyGalleryFromString/File, startup.InstallModels. Worker subcommands
gain the same env-bound flag on WorkerFlags so distributed-worker
installs honor it consistently with the worker daemon path.

Add a forbidigo lint rule against os.Getenv / os.LookupEnv / os.Environ
to keep the env-leak pattern from creeping back. Existing offenders
(p2p, config loaders, etc.) are baseline-grandfathered by the existing
new-from-merge-base: origin/master setting; targeted path exclusions
cover the legitimate cases — kong CLI entry points, backend
subprocesses, system capability probes, gRPC AUTH_TOKEN inheritance,
test gating env vars.

Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-18 08:02:20 +02:00
LocalAI [bot]
11cff1b309 chore: ⬆️ Update ggml-org/llama.cpp to 87589042cac2c390cec8d68fb2fad64e0a2a252a (#9855)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-18 08:01:30 +02:00
LocalAI [bot]
4ca3d2cdc0 docs: ⬆️ update docs version mudler/LocalAI (#9863)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:20:16 +02:00
LocalAI [bot]
3cba35ed32 chore: ⬆️ Update antirez/ds4 to c9dd9499bfa57c1bbfbb4446eff963330ab5329b (#9864)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:58 +02:00
LocalAI [bot]
265ae35231 chore: ⬆️ Update ikawrakow/ik_llama.cpp to c35189d83c91aad780aba62b89f2830cb2916223 (#9866)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-17 23:19:43 +02:00
LocalAI [bot]
6a48157a80 chore: ⬆️ Update leejet/stable-diffusion.cpp to bd17f53b7386fb5f60e8587b75e73c4b2fed3426 (#9854)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 23:12:05 +02:00
LocalAI [bot]
41c838b2df chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3e573cfea6e0a332eff822ffbdb1dd3b112e9051 (#9856)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:44:08 +02:00
LocalAI [bot]
21e793ad2a chore: ⬆️ Update antirez/ds4 to ef0a4905d05263df8e63689f2dd1efac618a752c (#9857)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:46 +02:00
LocalAI [bot]
7c190bb4b9 docs: ⬆️ update docs version mudler/LocalAI (#9853)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 22:43:06 +02:00
LocalAI [bot]
d77a9137d8 feat(llama-cpp): bump to MTP-merge SHA and automatically set MTP defaults (#9852)
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type

Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.

No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.

Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.

Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load

Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:

  - spec_type:draft-mtp
  - spec_n_max:6
  - spec_p_min:0.75

The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.

Detection runs in two places:

  - The model importer (`POST /models/import-uri`, the `/import-model`
    UI) range-fetches the GGUF header for HuggingFace / direct-URL
    imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
    non-fatal error handling. OCI/Ollama URIs are skipped because the
    artifact is not directly streamable; the load-time hook covers them
    once the file is on disk.
  - The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
    header on every model start and appends the same options if
    `spec_type` is not already set.

Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(importer): resolve huggingface:// URIs before MTP header probe

`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.

Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 22:42:48 +02:00
LocalAI [bot]
661a0c3b9d fix(ollama): accept float-encoded integer options (fixes #9837) (#9849)
fix(ollama): accept float-encoded integer options (num_ctx, top_k, ...)

Home Assistant's Ollama integration encodes integer options as JSON
floats (e.g. `"num_ctx": 8192.0`). Stdlib `json.Unmarshal` refuses to
decode a number with fractional notation into an `int` field, so the
entire request was rejected with HTTP 400 before reaching the backend:

  Unmarshal type error: expected=int, got=number 8192.0,
  field=options.num_ctx

Add a custom `UnmarshalJSON` on `OllamaOptions` that routes the int
fields (`top_k`, `num_predict`, `seed`, `repeat_last_n`, `num_ctx`)
through `*json.Number`, then converts via `Int64()` with a `Float64()`
fallback. Public field types are unchanged, so endpoint code is
untouched. Float fields and `stop` continue to parse via the default
path.

Fixes #9837

Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-16 18:38:19 +02:00
LocalAI [bot]
00b8989886 chore: ⬆️ Update ggml-org/llama.cpp to 1348f67c58f561808136e8a152a9eddec168f221 (#9842)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 08:41:09 +02:00
LocalAI [bot]
43e0d397ca chore: ⬆️ Update ggml-org/whisper.cpp to 968eebe77225d25e57a3f981da7c696310f0e881 (#9843)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-16 00:30:04 +02:00
LocalAI [bot]
a1a7a219ed chore: ⬆️ Update antirez/ds4 to 950e8e6474a1c9fabe04e669d607606a7ef8824f (#9844)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:46:29 +02:00
LocalAI [bot]
3937ec6527 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 5cc0d86c760e9858e4bed4418400bb39dbe025f2 (#9845)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:54 +02:00
LocalAI [bot]
1355b55794 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.21.0 (#9846)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 23:45:41 +02:00
Richard Palethorpe
5a2626d465 fix(deps): bump gomarkdown/markdown for GHSA-77fj-vx54-gvh7 (#9841)
Out-of-bounds read in SmartypantsRenderer.smartLeftAngle (CWE-125,
CVSS 7.5). Reachable transitively via LocalAGI's Email connector,
which renders inbound HTML email replies using html.CommonFlags
(includes Smartypants). An unmatched `<` in the inbound body could
panic the agent service.

Bump to v0.0.0-20260411013819-759bbc3e3207 (contains the fix). The
klauspost/compress entry loses its `// indirect` tag because
go mod tidy noticed pkg/utils/untar.go imports it directly.

Assisted-by: Claude:claude-opus-4-7 [Claude-Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-15 21:48:59 +02:00
LocalAI [bot]
a39591f144 realtime: honor output_modalities to skip TTS in text-only mode (#9838)
* realtime: honor output_modalities to skip TTS in text-only mode

The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.

This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.

Assisted-by: Claude:claude-opus-4-7

* realtime: plumb response-level output_modalities and echo on session

Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
  override of an audio session is honored (the test asserted this contract
  but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
  round-trips the client-supplied value, matching MaxOutputTokens's pattern.

Assisted-by: Claude:claude-opus-4-7

* realtime: silence errcheck on deferred os.Remove of TTS file

CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."

Assisted-by: Claude:claude-opus-4-7 golangci-lint

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 12:39:47 +02:00
massy_o
8c785dbe4a Validate archive member paths before extraction (#9820)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-15 11:12:13 +02:00
LocalAI [bot]
4abf5befbb chore: ⬆️ Update ggml-org/llama.cpp to 834a243664114487f99520370a7a7b00fc7a486f (#9826)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:29:22 +02:00
LocalAI [bot]
195b910260 chore: ⬆️ Update leejet/stable-diffusion.cpp to 0b8296915c4094090cff6bd2e09a5e98288c3c7d (#9827)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:52 +02:00
LocalAI [bot]
ba21bf667c docs: ⬆️ update docs version mudler/LocalAI (#9825)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:19:34 +02:00
LocalAI [bot]
7bd1693ad0 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 0fcffdb64d21e57f0778f342415754156e01adfa (#9828)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:46 +02:00
LocalAI [bot]
b5ac3a7373 chore: ⬆️ Update ggml-org/whisper.cpp to 46ca43d6399fdeada1b49fb2126ba373bd9ebc38 (#9829)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:24 +02:00
LocalAI [bot]
53de474ef5 chore: ⬆️ Update antirez/ds4 to 04b6fda2be395094cbf2d20d921e7a705a4166ef (#9830)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 10:08:09 +02:00
LocalAI [bot]
c33d36b870 fix(ollama): guard nil filter in galleryop.ListModels (#9817) (#9836)
The Ollama /api/tags handler passes a nil filter to galleryop.ListModels.
When ModelsPath contains any non-skipped loose file the function then
calls filter(name, nil) and panics, which Echo surfaces to clients as
"Server disconnected without sending a response" - the exact failure
Home Assistant's Ollama integration reports against LocalAI.

Mirror the nil guard already present in
ModelConfigLoader.GetModelConfigsByFilter so every caller is safe, and
add a regression test that exercises the loose-file path with a nil
filter.

Assisted-by: claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 10:07:50 +02:00
LocalAI [bot]
57fa178a64 feat(swagger): update swagger (#9824)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-15 09:30:29 +02:00
massy_o
745473cbe6 Validate video image URLs before download (#9819)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 15:07:17 +02:00
massy_o
594c9fd92e Close Hugging Face scan response body (#9818)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 12:35:29 +02:00
LocalAI [bot]
8af963bdd9 fix(streaming): comply with OpenAI usage / stream_options spec (#9815)
* fix(streaming): comply with OpenAI usage / stream_options spec (#8546)

LocalAI emitted `"usage":{"prompt_tokens":0,...}` on every streamed
chunk because `OpenAIResponse.Usage` was a value type without
`omitempty`. The official OpenAI Node SDK and its consumers
(continuedev/continue, Kilo Code, Roo Code, Zed, IntelliJ Continue)
filter on a truthy `result.usage` to detect the trailing usage chunk;
LocalAI's zero-but-non-null usage on every intermediate chunk made
that filter swallow every content chunk and surface an empty chat
response while the server log looked successful.

Changes:

- `core/schema/openai.go`: `Usage *OpenAIUsage \`json:"usage,omitempty"\``
  so intermediate chunks no longer carry a `usage` key. Add
  `OpenAIRequest.StreamOptions` with `include_usage` to mirror OpenAI's
  request field.
- `core/http/endpoints/openai/chat.go` and `completion.go`: keep using
  the `Usage` struct field as an in-process channel for the running
  cumulative, but strip it before JSON marshalling. When the request
  set `stream_options.include_usage: true`, emit a dedicated trailing
  chunk with `"choices": []` and the populated usage (matching the
  OpenAI spec and llama.cpp's server behavior).
- `chat_emit.go`: new `streamUsageTrailerJSON` helper; drop the
  `usage` parameter from `buildNoActionFinalChunks` since chunks no
  longer carry usage.
- Update `image.go`, `inpainting.go`, `edit.go` to wrap their Usage
  values with `&` for the new pointer field.
- UI: send `stream_options:{include_usage:true}` from the React
  (`useChat.js`) and legacy (`static/chat.js`) chat clients so the
  token-count badge keeps populating now that the server is
  spec-compliant.

Tests:

- New `chat_stream_usage_test.go` pins the spec invariants:
  intermediate chunks have no `usage` key, the trailer JSON has
  `"choices":[]` and a populated `usage`, and `OpenAIRequest` parses
  `stream_options.include_usage`.
- Update `chat_emit_test.go` to reflect that finals no longer embed
  usage.

Verified against the live LocalAI instance: before the fix Continue's
filter logic swallowed 16/16 token chunks; with the new shape it
yields 4/5 and routes usage through the dedicated trailer chunk.

Fixes #8546

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(streaming): silence errcheck on usage trailer Fprintf

The new spec-compliant `stream_options.include_usage` trailer writes
were flagged by errcheck since they're new code (golangci-lint runs
new-from-merge-base on master); the surrounding `fmt.Fprintf` data:
writes are grandfathered. Drop the return values explicitly to match
the linter's contract without adding a nolint shim.

Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:46 +02:00
LocalAI [bot]
6e1dbae256 feat(llama-cpp): expose 12 missing common_params via options[] (#9814)
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.

Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).

Top-level / batching / IO:
  - n_ubatch (alias ubatch) -- physical batch size; was previously
    force-aliased to n_batch at line 482, blocking embedding/rerank
    workloads that need independent control
  - threads_batch (alias n_threads_batch) -- main-model batch threads;
    mirrors the existing draft_threads_batch
  - direct_io (alias use_direct_io) -- O_DIRECT model loads
  - verbosity -- llama.cpp log threshold (line 479 had this commented
    out)
  - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
    overrides for the main model; mirrors draft_override_tensor

Embedding / multimodal:
  - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
    only auto-flipped to RANK for rerankers
  - embd_normalize (alias embedding_normalize) -- and the embedding
    handler now reads params_base.embd_normalize instead of a hardcoded
    2 at the previous embd_normalize literal in Embedding()
  - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
  - image_min_tokens / image_max_tokens -- per-image vision token budget

Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
  - reasoning_format -- none/auto/deepseek/deepseek-legacy parser
  - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
  - prefill_assistant -- trailing-assistant-message prefill toggle

All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.

Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:34 +02:00
LocalAI [bot]
53bdb18d10 chore: ⬆️ Update ggml-org/llama.cpp to 7f3f843c31cd32dc4adc10b393342dfee071c332 (#9809)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename

ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
  COMMON_SPECULATIVE_TYPE_DRAFT  -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
  COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.

This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.

Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 08:53:23 +02:00
LocalAI [bot]
42a8db3573 ci(image): publish missing :latest-* and :v<X>-* singleton image tags (#9812)
* ci(image): wire singleton merges + `--` artifact separator

Closes the same singletons gap on the LocalAI server image workflow that
PR #9781 closed for backends. The user observed it as missing
:latest-gpu-nvidia-cuda-12 etc. on quay.io/go-skynet/local-ai — the
build matrix has six single-arch entries with no corresponding merge
step, so their per-arch digests push (push-by-digest=true) and never
get tagged:

  - -gpu-hipblas              (hipblas-jobs)
  - -gpu-nvidia-cuda-12       (core-image-build)
  - -gpu-nvidia-cuda-13       (core-image-build)
  - -gpu-intel                (core-image-build)
  - -nvidia-l4t-arm64         (gh-runner)
  - -nvidia-l4t-arm64-cuda-13 (gh-runner)

Only :latest, :v<X>, :latest-gpu-vulkan and :v<X>-gpu-vulkan were
actually being published before this commit (the two multiarch suffixes
that had merge jobs).

Changes:

1. image.yml: add six new merge jobs, one per single-arch entry. Each
   `needs:` only its parent build job (matching the existing pattern
   for core-image-merge / gpu-vulkan-image-merge).
2. image_build.yml: switch artifact name to
   `digests-localai<suffix>--<platform-tag-or-"single">`. The `--`
   separator anchors the merge-side glob so a singleton tag-suffix
   doesn't over-match a longer suffix that shares its prefix
   (-nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13). Same convention
   as backend_build.yml's fix.
3. image_merge.yml: update the download pattern to match.

Next master push or tag release should produce :latest-gpu-hipblas,
:latest-gpu-nvidia-cuda-12, :latest-gpu-nvidia-cuda-13, :latest-gpu-intel,
:latest-nvidia-l4t-arm64, :latest-nvidia-l4t-arm64-cuda-13 (and their
:v<X>-* equivalents) for the first time on the post-#9781 workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(image): add !cancelled() guard to all 8 image merge jobs

Parity pass with backend.yml's merge jobs (8521af14). Without
!cancelled(), GHA's default `needs:` cascade skips the merge when ANY
matrix cell of the parent build job fails or is cancelled — so a single
flaky leg would suppress publication of every other tag-suffix's
manifest list. Same fix the backend got after v4.2.1 showed 2 failed
singlearch builds cascade-skip 199 singlearch merge entries.

Applied to all 8 image merges:

  - core-image-merge
  - gpu-vulkan-image-merge
  - gpu-nvidia-cuda-12-image-merge       (added in e5300f1a)
  - gpu-nvidia-cuda-13-image-merge       (added in e5300f1a)
  - gpu-intel-image-merge                (added in e5300f1a)
  - gpu-hipblas-image-merge              (added in e5300f1a)
  - nvidia-l4t-arm64-image-merge         (added in e5300f1a)
  - nvidia-l4t-arm64-cuda-13-image-merge (added in e5300f1a)

Build jobs (hipblas-jobs, core-image-build, gh-runner) are
intentionally NOT changed — they have no upstream `needs:` to cascade-
skip from.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:28:48 +02:00
LocalAI [bot]
0353d3bd77 chore: ⬆️ Update ggml-org/whisper.cpp to 3e9b7d0fef3528ee2208da3cdb873a2c53d2ae2f (#9808)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:20:14 +02:00
LocalAI [bot]
ec49995190 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 949bb8f1d660fc1264c137a6f3dbd619375f6134 (#9807)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:15:32 +02:00
Tai An
67c34bbb96 fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions (#9559)
* fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions

Follows up on #9526 (the 3-site setter fix) by addressing the remaining
clause in #9508 — string mode and OpenAI-spec specific-function shape both
silently failed in the /v1/chat/completions parsing path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(middleware): restore LF endings and cover tool_choice parsing with specs

The previous commit on this branch saved core/http/middleware/request.go
with CRLF line endings, ballooning the diff against master to 684 / 651
for what is in reality a ~50-line parsing change. Restore LF (matches
.editorconfig end_of_line = lf).

Add 11 Ginkgo specs under "SetModelAndConfig tool_choice parsing
(chat completions)" that parallel the existing MergeOpenResponsesConfig
specs from #9509. They drive the full middleware chain (SetModelAndConfig
+ SetOpenAIRequest) and assert:

  * "required"  -> ShouldUseFunctions=true, no specific name
  * "none"      -> ShouldUseFunctions=false (tools disabled per OpenAI spec)
  * "auto"      -> default, tools available, no specific name
  * {type:function, function:{name:X}}  (spec)    -> X is forced
  * {type:function, name:X}             (legacy)  -> X is forced
  * nested wins when both forms are present
  * malformed shapes (no type, wrong type, no name, empty name) are no-ops

Update the inline comment on the string case to describe the actual
mechanism: "none" reaches SetFunctionCallString("none") downstream and
is then honored by ShouldUseFunctions() returning false. Before this PR
json.Unmarshal([]byte("none"), &functions.Tool{}) failed silently, so
"none" was ignored - making "none" actually work is a real behavior fix
this PR brings.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* fix(middleware): preserve pre-#9559 support for JSON-string-encoded tool_choice

Some non-spec clients send tool_choice as a JSON-encoded string of an
object form, e.g. "{\"type\":\"function\",\"function\":{\"name\":\"X\"}}".
The pre-#9559 code accepted this by accident: its case string: branch
ran json.Unmarshal([]byte(content), &functions.Tool{}), which succeeded
for that double-encoded shape even though it failed for the legitimate
plain string modes "auto" / "none" / "required".

The first version of this PR routed every string straight to
SetFunctionCallString as a mode, which fixed the plain-string cases but
silently regressed the double-encoded one (funcs.Select("{...}") returns
nothing). Restore the fallback: when a string looks like a JSON object,
try parsing it as a tool_choice map first; fall through to mode-string
handling only when no usable name comes out.

Factor the map-name extraction into a small helper
(extractToolChoiceFunctionName) so the string-fallback and the regular
map case go through identical code, and accept both the OpenAI-spec
nested shape and the legacy/Anthropic flat shape from either entry
point.

Add 3 Ginkgo specs covering the double-encoded case (nested form, legacy
form, and the fall-through when the JSON has no usable name).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* test(middleware): silence errcheck on AfterEach os.RemoveAll

The new tool_choice parsing tests added a second AfterEach that calls
os.RemoveAll(modelDir) without checking the error; errcheck flagged it.
Suppress with the standard _ = idiom. The pre-existing AfterEach on the
earlier Describe still elides the check the same way it did before -
leaving that untouched to keep this commit minimal.

Assisted-by: Claude:opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-14 00:14:38 +02:00
LocalAI [bot]
4430fae779 chore: ⬆️ Update antirez/ds4 to 0cba357ca1bc0e7510421cc26888e420ea942123 (#9806)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-14 00:14:23 +02:00
LocalAI [bot]
ab01ed1a3e fix(agentpool): close truncate-then-read race in agent_jobs.json persistence (#9811)
* fix(agentpool): close truncate-then-read race in agent_jobs.json persistence

Three call sites wrote and read agent_jobs.json (and agent_tasks.json)
through three independent mutexes:

  - AgentJobService.ExecuteJob spawns go saveJobs(job) -> fileJobPersister
    holding p.mu
  - AgentJobService.SaveJobsToFile holding service.fileMutex
  - AgentJobService.LoadJobsFromFile on a separate service instance holding
    a different service.fileMutex

Nothing serialized those mutexes, and both writers used os.WriteFile, which
opens O_TRUNC. A reader landing between the truncate and the write saw a
zero-byte file and surfaced as `unexpected end of JSON input` at offset 0.
The macOS tests-apple job started hitting this consistently once the path
filter was removed from .github/workflows/test.yml and the file-mode race
test ran on every push (run 25823124797 was the first observed failure).

Two changes close the window:

1. fileJobPersister.saveTasksToFile / saveJobsToFile now write to a
   same-directory temp file and os.Rename to the final path. rename(2) is
   atomic on POSIX, so concurrent readers see either the prior contents or
   the new contents and never a zero-byte window. The helper Syncs before
   close so a crash mid-write leaves either the old file intact or the temp
   behind (cleaned up on next save).

2. AgentJobService.{Load,Save}{Tasks,Jobs}{FromFile,ToFile} are collapsed
   to thin wrappers around fileJobPersister, removing the duplicate write
   path and the redundant service.fileMutex / service.tasksFile /
   service.jobsFile fields. Within a single service all task/job I/O now
   serializes on the persister's mutex; the atomic rename handles the
   cross-instance case the tests exercise.

Adds a regression test that hammers SaveJobsToFile and LoadJobsFromFile
concurrently for 500ms across two service instances on the same paths.
On master this reproduces `unexpected end of JSON input` on Linux within
~500ms; with the fix the suite ran -until-it-fails for 30s (54 attempts,
all green).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(agentpool): route service flush/load through JobPersister interface

The first cut of the race fix made AgentJobService.{Save,Load}{Tasks,Jobs}*
type-assert s.persister to *fileJobPersister so they could reach the
unexported saveTasksToFile / saveJobsToFile helpers. That defeats the
JobPersister interface: the service is back to reasoning about a concrete
implementation instead of an abstraction.

Promote the bulk-flush operations to the interface as FlushTasks / FlushJobs:

  - fileJobPersister.FlushTasks/FlushJobs call the existing private helpers
    (atomic temp+rename writes from the prior commit).
  - dbJobPersister.FlushTasks/FlushJobs are no-ops because SaveTask/SaveJob
    are already write-through to the database.

The service's four file-named methods now talk only to the interface:
LoadTasks/LoadJobs read through s.persister.LoadTasks/LoadJobs, and the
Save side calls FlushTasks/FlushJobs. The "FromFile"/"ToFile" suffixes
stay for backward compat with user_services.go and the existing tests,
but they no longer claim a file-only contract.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 23:58:43 +02:00
Ettore Di Giacinto
6bfe7f8c05 ci(image-merge): apply the keepalive+ci-cache source fix to image_merge
Mirror of 8521af14 (which fixed backend_merge.yml) for image_merge.yml.
Today's master-push run 25823024353 failed the gpu-vulkan-image-merge job
with the exact same error pattern the backend merge had on v4.2.2:

  ERROR: quay.io/go-skynet/local-ai@sha256:68b22611...: not found

Same root cause: image_build.yml pushes the per-arch manifest to
quay.io/go-skynet/local-ai with push-by-digest=true (no tag), then the
merge runs minutes-to-hours later, by which time quay's per-repo manifest
GC has reaped the untagged digest from local-ai. The blob still lives in
quay's storage but local-ai@<digest> no longer resolves.

Three matching edits:

1. image_build.yml: anchor each per-arch digest into ci-cache immediately
   after the push, reusing .github/scripts/anchor-digest-in-cache.sh with
   SOURCE_IMAGE=quay.io/go-skynet/local-ai and TAG_SUFFIX defaulting to
   "-core" for the core image (matches the artifact-name convention).
2. image_merge.yml: change the quay merge source from local-ai@<digest>
   to ci-cache@<digest>. Same correctness argument as backend_merge.yml —
   the manifest content is alive in ci-cache; buildx imagetools create
   republishes it into local-ai and writes the user-facing manifest list
   pointing at it. End state in local-ai is self-contained.
3. image_merge.yml: add a sparse `actions/checkout@v6` (only
   .github/scripts) so cleanup-keepalive-tags.sh is available, plus the
   cleanup step itself with TAG_SUFFIX matching the anchor's "-core"
   placeholder.

v4.2.3's image.yml run completed successfully (~50 min between push and
merge — beat quay's GC). This commit closes the race for future releases
and master pushes regardless of run length.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:37:03 +00:00
LocalAI [bot]
5a42dbf3ec docs: ⬆️ update docs version mudler/LocalAI (#9805)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 22:42:54 +02:00
Adira
c2fe0a6475 fix(http): honor X-Forwarded-Prefix when proxy strips the prefix (#9614)
* fix(http): honor X-Forwarded-Prefix when proxy strips the prefix

Closes #9145.

Two related issues kept the React UI from loading when a reverse proxy
rewrites a sub-path with prefix-stripping (e.g. Caddy `handle_path`):

1. `BaseURL` only computed a prefix from the path StripPathPrefix had
   removed, so when the proxy strips the prefix before forwarding, the
   request arrives without it and the base URL was returned without a
   prefix. Extract a `BasePathPrefix` helper and add an
   `X-Forwarded-Prefix` header fallback so the prefix is recovered.
2. `<base href>` only changes how relative URLs resolve; the build
   emits path-absolute references like `/assets/...` and
   `/favicon.svg`, which still resolve against the origin and bypass
   the proxy prefix. Rewrite those references in the served
   `index.html` so the browser requests them through the proxy.

Adds unit coverage for `BaseURL` with a pre-stripped path and an
end-to-end test for the proxy-stripped scenario.

Assisted-by: Claude:claude-opus-4-7

* fix(http): gate X-Forwarded-Prefix through SafeForwardedPrefix in BasePathPrefix

BasePathPrefix consumed X-Forwarded-Prefix directly, so a value the
codebase elsewhere rejects (e.g. "//evil.com") slipped through and was
interpolated into the SPA index.html — both into the path-absolute asset
URL rewrite in serveIndex (turning "/assets/..." into "//evil.com/assets/...",
a protocol-relative URL that loads JS from a foreign origin) and into
<base href>. Route the header through the existing SafeForwardedPrefix
validator that StripPathPrefix and prefixRedirect already use, and
HTML-escape the prefix before injecting it into the asset rewrite as
defense in depth against attribute breakout.

Tests cover //evil.com, backslashes, control chars, CR/LF and a missing
leading slash; the integration test asserts an unsafe prefix can't poison
asset URLs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:59:33 +02:00
LocalAI [bot]
ddbbdf45b9 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403 (#9740)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:58:33 +02:00
LocalAI [bot]
b4fdb41dcc fix(distributed): cascade-clean stale node_models rows + filter routing by healthy status (#9754)
* fix(distributed): cascade-clean stale node_models on drain and filter routing by healthy status

Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.

Two changes here, plus a third in a follow-up commit:

- MarkDraining now cascade-deletes node_models rows for the affected
  node, mirroring MarkOffline. Drains are explicit operator actions —
  the box has been intentionally taken out of rotation — so clearing
  the rows stops the Models UI from misreporting and prevents the
  routing layer from picking those rows if scheduling logic is ever
  relaxed. In-flight requests already hold their gRPC client through
  Route() and finish normally; the only observable effect is a
  non-fatal IncrementInFlight warning, acceptable for a drain.

  MarkUnhealthy is deliberately left status-only: it fires from
  managers_distributed / reconciler on a single nats.ErrNoResponders
  with no retry, so a transient NATS hiccup must not nuke every loaded
  model and force a full reload on recovery.

- FindAndLockNodeWithModel's inner JOIN now filters on
  backend_nodes.status = healthy in addition to node_models.state =
  loaded. The previous version relied on the second node-fetch step to
  reject non-healthy nodes, but a concurrent reader could still pick
  the same stale row in the same window. Belt-and-braces.

- DistributedConfig.PerModelHealthCheck renamed to
  DisablePerModelHealthCheck and inverted at the call site so
  per-model gRPC probing is on by default. The probe (now made
  consecutive-miss aware in a follow-up commit) independently health-
  checks each model's gRPC address and removes stale node_models rows
  when the backend has crashed even though the worker's node-level
  heartbeat is still arriving.

  Migration: the field had no CLI flag, env var binding, or YAML key
  in tree (only the bare struct field), so there is no user-facing
  migration. Anything constructing DistributedConfig in code needs to
  drop the assignment (default now does the right thing) or invert it.

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): require consecutive misses before per-model probe removes a row

The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.

Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.

If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.

Tests:
- removes stale model via per-model health check after consecutive
  failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
  success (covers the reset-on-success path and verifies the counter
  reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
  test helpers don't nil-map-panic in the probe path

Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-13 21:57:50 +02:00
Richard Palethorpe
0245b33eab feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801)
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase

Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.

Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
  `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
  Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
  Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
  on `liquid_audio.utils.snapshot_download` so absolute local paths from
  LocalAI's gallery resolve without a HF round-trip. soundfile in place of
  torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
  interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
  (config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
  new RPC to the test fakes.

Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
  ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
  membership in both speech-input and audio-output groups so a lone flag
  still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
  branch.

Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
  so the gate is unit-testable; accept realtime_audio models and self-fill
  empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
  cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
  user-pinned VAD slot, and partial legacy pipeline.

UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
  VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
  self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
  the four-slot grid; rename Edit Pipeline → Edit Model Config; less
  pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
  realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
  fields, surfaced when backend === liquid-audio, plumbed via
  extra_options on submit/export/import.

Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
  [realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
  mode option. Fixed pre-existing `transcribe` typo on the asr entry
  (loader silently dropped the unknown string → entry never surfaced as a
  transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
  `<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
  common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
  matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
  preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
  before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
  regex pair on the LFM2 pythonic format.

Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
  l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
  is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
  3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
  docker-build-liquid-audio.

Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
  any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
  detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
  remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
  checklist with every place a new FLAG_* needs to be registered.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): function calling + history cap for any-to-any models

Three pieces, all on the realtime_audio path that just landed:

1. liquid-audio backend (backend/python/liquid-audio/backend.py):
   - _build_chat_state grows a `tools_prelude` arg.
   - new _render_tools_prelude parses request.Tools (the OpenAI Chat
     Completions function array realtime.go already serialises) and
     emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
     ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
     template so the model sees the same prompt shape whether served
     via llama-cpp or here. Without this the backend silently dropped
     tools — function calling was wired end-to-end on the Go side but
     the model never saw a tool list.

2. Realtime history cap (core/http/endpoints/openai/realtime.go):
   - Session grows MaxHistoryItems int; default picked by new
     defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
     1.5B degrades quickly past a handful of turns), 0/unlimited for
     legacy pipelines composing larger LLMs.
   - triggerResponse runs conv.Items through trimRealtimeItems before
     building conversationHistory. Helper walks the cut left if it
     would orphan a function_call_output, so tool result + call pairs
     stay intact.
   - realtime_gate_test.go: specs for defaultMaxHistoryItems and
     trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
     preservation).

3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
   - Reuses the chat page's MCP plumbing — useMCPClient hook,
     ClientMCPDropdown component, same auto-connect/disconnect effect
     pattern. No bespoke tool registry, no new REST endpoints; tools
     come from whichever MCP servers the user toggles on, exactly as
     on the chat page.
   - sendSessionUpdate now passes session.tools=getToolsForLLM(); the
     update re-fires when the active server set changes mid-session.
   - New response.function_call_arguments.done handler executes via
     the hook's executeTool (which round-trips through the MCP client
     SDK), then replies with conversation.item.create
     {type:function_call_output} + response.create so the model
     completes its turn with the tool output. Mirrors chat's
     client-side agentic loop, translated to the realtime wire shape.

UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page

Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.

Wire:
- core/http/endpoints/openai/realtime.go:
  - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
    helper mirrors chat.go's requireAssistantAccess (no-op when auth
    disabled, else requires auth.RoleAdmin).
  - Session grows AssistantExecutor mcpTools.ToolExecutor.
  - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
    fail closed if DisableLocalAIAssistant or the holder has no tools,
    DiscoverTools and inject into session.Tools, prepend
    holder.SystemPrompt() to instructions.
  - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
    ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
    the function_call_arguments client emit (the client can't execute
    these — it doesn't know about them). After the loop, if any
    assistant tool ran, trigger another response so the model speaks the
    result. Mirrors chat's agentic loop, driven server-side rather than
    via client round-trip.

- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
  gains `localai_assistant` (JSON omitempty). Handshake calls
  isCurrentUserAdmin and builds RealtimeSessionOptions.

- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
  checkbox under the Tools dropdown; passes localai_assistant: true to
  realtimeApi.call's body, captured in the connect callback's deps.

Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(realtime): render Manage Mode tool calls in the Talk transcript

Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.

realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.

Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools

Fallout from a review pass on the Manage Mode patches:

- Bound the server-side agentic loop. triggerResponse used to recurse on
  executedAssistantTool with no cap — a model that kept calling tools
  would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
  useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
  over triggerResponseAtTurn(toolTurn int); recursion increments the
  counter and stops at the cap with an xlog.Warn.

- Preserve Manage Mode tools across client session.update. The handler
  used to blindly overwrite session.Tools, so toggling a client MCP
  server mid-session silently wiped the in-process admin tools. Session
  now caches the original AssistantTools slice at session creation and
  the session.update handler merges them back in (client names win on
  collision — the client is explicit).

- strconv.ParseBool for the localai_assistant query param instead of
  hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.

- Talk.jsx: render both tool_call and tool_result on
  response.output_item.done instead of splitting them across .added and
  .done. The server's event pairing (added → done) stays correct; the
  UI just doesn't need to inspect both phases of the same item. One
  switch case instead of two, no behavioural change.

Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).

Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(test-extra): wire liquid-audio backend smoke test

The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.

- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
  backend/python/liquid-audio, gated on the per-backend detect flag

The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.

The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-13 21:57:27 +02:00
Andreas Egli
a2940e5d47 feat: also parse VRAM budget/usage from vulkaninfo (#9800)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-13 21:43:12 +02:00
LocalAI [bot]
a645c1f4aa chore: ⬆️ Update ggml-org/llama.cpp to a9883db8ee021cf16783016a60996d41820b5195 (#9796)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 21:40:31 +02:00
LocalAI [bot]
957619af53 chore: ⬆️ Update ikawrakow/ik_llama.cpp to f9a93c37e2fc021760c3c1aa99cf74c73b7591a7 (#9795)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:48 +02:00
LocalAI [bot]
ad0ab37230 docs: ⬆️ update docs version mudler/LocalAI (#9792)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:37 +02:00
LocalAI [bot]
0b81e36504 chore: ⬆️ Update antirez/ds4 to f8b4ed635d559b3a5b44bf2df6a77e21b3e9178f (#9794)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:40:09 +02:00
LocalAI [bot]
602866a9d8 chore: ⬆️ Update ggml-org/whisper.cpp to 338cce1e58133261753243802a0e7a430118866d (#9793)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-13 00:39:57 +02:00
Ettore Di Giacinto
8521af145f ci(merge): source per-arch digests from ci-cache, not local-ai-backends
Follow-up to PR #9781. v4.2.2 (run 25745181433) showed the keepalive
anchor in ci-cache wasn't enough on its own: 19 of 37 multiarch merges
still failed with "manifest not found" for the same digests we'd just
anchored.

Quay's manifest GC is per-repository. The anchor tag in ci-cache
protects the manifest copy that lives in ci-cache, but the same digest
in local-ai-backends is independently tracked and gets reaped because
nothing in local-ai-backends references it (push-by-digest=true leaves
it untagged). The merge then asks
`local-ai-backends@sha256:<digest>` and quay correctly says "not found"
in that repo, even though `ci-cache@sha256:<digest>` is alive and well.

Empirical confirmation against a live failed digest from v4.2.2:

  $ docker buildx imagetools inspect quay.io/go-skynet/ci-cache@sha256:05377fe6...
  Name: quay.io/go-skynet/ci-cache@sha256:05377fe6...
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  $ docker buildx imagetools inspect quay.io/go-skynet/local-ai-backends@sha256:05377fe6...
  ERROR: ... not found

Switch the source of the quay merge step to ci-cache. The blobs the
manifest references are already accessible from local-ai-backends
(verified via direct registry HEAD: HTTP 200 from both repos — the
original push cross-mounted blobs at content-addressable storage time
and they outlive the per-repo manifest GC). buildx imagetools create
republishes the manifest into local-ai-backends, then writes the
user-facing manifest list pointing at it. End state is self-contained:
the published manifest list references child manifests by digest only,
no embedded reference to ci-cache.

Dockerhub merge step is unchanged. Dockerhub's GC isn't aggressive
enough to reap untagged manifests at the timescales we operate on
(verified: localai/localai-backends@<same digest> still resolves cleanly
after >24h).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 22:35:55 +00:00
LocalAI [bot]
bc4cd3dd85 feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:37 +02:00
LocalAI [bot]
86a7f6c9fa ci: close GC race + cascade-skip + darwin grpc gaps from v4.2.1 (#9781)
* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1

v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:

1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
   "manifest not found"). Even after PR #9746 split multi/single-arch
   merges, the multiarch matrix itself takes ~2h to drain at
   max-parallel: 8, and the earliest per-arch digests (push-by-digest,
   no tag) get reaped by quay's GC before the merge runs. The split
   bounded the race for multiarch; it doesn't eliminate it. Anchor each
   per-arch digest immediately to a tag in the internal ci-cache image
   (`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
   tagged manifests. backend_merge.yml deletes the keepalive tags via
   quay REST API after publishing the user-facing manifest list.
   Cleanup is best-effort: if the quay token is not OAuth-scoped the
   merge does NOT fail, the orphan tags just persist.

2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
   and 2 cancelled singlearch builds (out of 199); GHA's default
   `needs:` semantics cascade-skipped the entire singlearch merge
   matrix, so zero singleton tags were applied even though 197
   singletons built successfully. Wrap the merge `if:` in
   `!cancelled() && ...` for both multi and single arch in backend.yml
   and backend_pr.yml so partial build failures publish the successful
   tag-suffixes.

3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
   not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
   fix already in `Dependencies`: a brew cache hit restores
   `/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
   abseil isn't in our Cellar cache list and never gets installed
   alongside, leaving grpc's CMake unable to resolve it. Mirror the
   `brew reinstall ccache` line with `brew reinstall grpc` to
   re-validate grpc's full transitive dep closure on every cache-hit
   run.

4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
   wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
   -gpu-nvidia-cuda-12-turboquant 6h05m,
   -gpu-nvidia-cuda-13-llama-cpp 5h37m,
   -gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
   cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
   of the free-tier migration (PR #9730) had explicitly flagged this
   batch as 'highest-risk' with a per-entry revert path. All other
   matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
   intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.

Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract keepalive anchor + cleanup into .github/scripts/

The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:

  anchor-digest-in-cache.sh    backend_build.yml's keepalive anchor
  cleanup-keepalive-tags.sh    backend_merge.yml's best-effort cleanup

Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.

Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:22:09 +02:00
LocalAI [bot]
a57e73691d fix(ollama): accept prompt alias on /api/embed for Ollama parity (#9780)
Ollama's embedding endpoint accepts both `input` and `prompt` as the
input string value (see ollama/ollama docs/api.md#generate-embeddings).
LocalAI only accepted `input`, which broke client libraries that send
the `prompt` form.

Add `Prompt` to OllamaEmbedRequest and have GetInputStrings fall back
to it when Input is unset. Input still wins when both are provided.

Fixes #9767.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 17:21:20 +02:00
dependabot[bot]
a689100d61 chore(deps): bump the npm_and_yarn group across 1 directory with 3 updates (#9728)
Bumps the npm_and_yarn group with 3 updates in the /core/http/react-ui directory: [fast-uri](https://github.com/fastify/fast-uri), [hono](https://github.com/honojs/hono) and [ip-address](https://github.com/beaugunderson/ip-address).


Updates `fast-uri` from 3.1.0 to 3.1.2
- [Release notes](https://github.com/fastify/fast-uri/releases)
- [Commits](https://github.com/fastify/fast-uri/compare/v3.1.0...v3.1.2)

Updates `hono` from 4.12.14 to 4.12.18
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.14...v4.12.18)

Updates `ip-address` from 10.1.0 to 10.2.0
- [Commits](https://github.com/beaugunderson/ip-address/commits)

---
updated-dependencies:
- dependency-name: fast-uri
  dependency-version: 3.1.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.18
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: ip-address
  dependency-version: 10.2.0
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:54:38 +02:00
Andreas Egli
03815e3b59 fix: parse vulkan VRAM from text (#9669)
* fix: parse vulkan VRAM from text

Assisted-by: opencode:gpt-5.5
Signed-off-by: Andreas Egli <github@kharan.ch>

* fix: replace string.split with streaming iteration

Assisted-by: Opencode:Gemma4
Signed-off-by: Andreas Egli <github@kharan.ch>

---------

Signed-off-by: Andreas Egli <github@kharan.ch>
2026-05-12 09:53:48 +02:00
dependabot[bot]
37991c8a18 chore(deps): bump github.com/mudler/edgevpn from 0.31.1 to 0.32.2 (#9773)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.31.1 to 0.32.2.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.31.1...v0.32.2)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.32.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:51:39 +02:00
dependabot[bot]
61c9b187fa chore(deps): update charset-normalizer requirement from >=3.4.0 to >=3.4.7 in /backend/python/vllm (#9779)
chore(deps): update charset-normalizer requirement

Updates the requirements on [charset-normalizer](https://github.com/jawah/charset_normalizer) to permit the latest version.
- [Release notes](https://github.com/jawah/charset_normalizer/releases)
- [Changelog](https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jawah/charset_normalizer/compare/3.4.0...3.4.7)

---
updated-dependencies:
- dependency-name: charset-normalizer
  dependency-version: 3.4.7
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:22:23 +02:00
dependabot[bot]
c66014312e chore(deps): bump github.com/fsnotify/fsnotify from 1.9.0 to 1.10.1 (#9778)
Bumps [github.com/fsnotify/fsnotify](https://github.com/fsnotify/fsnotify) from 1.9.0 to 1.10.1.
- [Release notes](https://github.com/fsnotify/fsnotify/releases)
- [Changelog](https://github.com/fsnotify/fsnotify/blob/main/CHANGELOG.md)
- [Commits](https://github.com/fsnotify/fsnotify/compare/v1.9.0...v1.10.1)

---
updated-dependencies:
- dependency-name: github.com/fsnotify/fsnotify
  dependency-version: 1.10.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:18 +02:00
dependabot[bot]
abc2a51641 chore(deps): update transformers requirement from >=5.0.0 to >=5.8.0 in /backend/python/transformers (#9775)
chore(deps): update transformers requirement

Updates the requirements on [transformers](https://github.com/huggingface/transformers) to permit the latest version.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v5.0.0...v5.8.0)

---
updated-dependencies:
- dependency-name: transformers
  dependency-version: 5.8.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:21:05 +02:00
dependabot[bot]
cd7d163178 chore(deps): bump github.com/onsi/gomega from 1.39.1 to 1.40.0 (#9774)
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.39.1 to 1.40.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/gomega/compare/v1.39.1...v1.40.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.40.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:36 +02:00
dependabot[bot]
7aac599deb chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.27.0 to 1.42.0 (#9772)
chore(deps): bump github.com/anthropics/anthropic-sdk-go

Bumps [github.com/anthropics/anthropic-sdk-go](https://github.com/anthropics/anthropic-sdk-go) from 1.27.0 to 1.42.0.
- [Release notes](https://github.com/anthropics/anthropic-sdk-go/releases)
- [Changelog](https://github.com/anthropics/anthropic-sdk-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/anthropics/anthropic-sdk-go/compare/v1.27.0...v1.42.0)

---
updated-dependencies:
- dependency-name: github.com/anthropics/anthropic-sdk-go
  dependency-version: 1.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:24 +02:00
dependabot[bot]
d75173dd2a chore(deps): bump actions/download-artifact from 4 to 8 (#9771)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 8.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v4...v8)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:14 +02:00
dependabot[bot]
9be5310394 chore(deps): bump actions/upload-artifact from 4 to 7 (#9770)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:20:03 +02:00
dependabot[bot]
cdf50fd723 chore(deps): bump node from 25-slim to 26-slim (#9769)
Bumps node from 25-slim to 26-slim.

---
updated-dependencies:
- dependency-name: node
  dependency-version: 26-slim
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-12 09:19:51 +02:00
LocalAI [bot]
bc3fb16105 feat(ollama): report model capabilities + details on /api/tags and /api/show (#9766)
Ollama-compatible clients (Open WebUI, Enchanted, ollama-grid-search,
etc.) rely on the `capabilities` list and `details.{parameter_size,
quantization_level,families}` fields returned by /api/tags and
/api/show to decide which models are eligible for a given task --
for example to filter the "embedding model" picker. Upstream Ollama
returns these; LocalAI's compat layer was leaving them empty, so
embedding models were silently rejected by clients that only allow
chat models for chat and only allow embedding models for embeddings.

This wires up the existing config signals already present in
ModelConfig:

- modelCapabilities() derives the Ollama capability strings from the
  config: "embedding" (FLAG_EMBEDDINGS), "completion" (FLAG_CHAT /
  FLAG_COMPLETION), "vision" (explicit KnownUsecases bit or MMProj /
  multimodal template / backend media marker), "tools" (auto-detected
  ToolFormatMarkers, JSON/Response regex, XML format, grammar
  triggers), "thinking" (ReasoningConfig with reasoning not disabled)
  and "insert" (presence of a completion template).
- modelDetailsFromModelConfig() now fills families, parameter_size
  and quantization_level. The latter two are parsed from the GGUF
  filename via regex -- conservative tokens only (Q*/IQ*/F16/F32/BF16
  and \d+(\.\d+)?[BM] surrounded by separators) so we don't accidentally
  match "Qwen3" as "3B".
- modelInfoFromModelConfig() exposes general.architecture and
  general.context_length in the new ShowResponse.model_info map.

Note: HasUsecases(FLAG_VISION) cannot be used directly -- GuessUsecases
has no FLAG_VISION case and returns true at the end for any chat model.
hasVisionSupport() instead reads KnownUsecases explicitly plus MMProj /
template / media-marker signals.

Tests are written first (TDD) using Ginkgo/Gomega -- DescribeTable for
the capability mapping (embedding-only, chat, vision, thinking, tools
via markers, tools via JSON regex, no-capability rerank) plus
integration tests against ShowModelEndpoint that round-trip JSON
through a real ModelConfigLoader populated from a temp YAML file.

Fixes #9760.


Assisted-by: Claude Code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-12 00:16:19 +02:00
LocalAI [bot]
78722caedc chore: ⬆️ Update ikawrakow/ik_llama.cpp to eb570eb96689c235933b813693ca28ab9d3d26de (#9764)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-12 00:02:22 +02:00
LocalAI [bot]
621c612b2d ci(bump-deps): register ds4 + move version pin into the Makefile (#9761)
* ci(bump-deps): register ds4 + move version pin into the Makefile

The initial ds4 PR (#9758) put the upstream commit pin in
backend/cpp/ds4/prepare.sh as a shell variable. The auto-bump bot at
.github/bump_deps.sh greps for ^$VAR?= in a Makefile, so DS4_VERSION
was invisible to it - other backends (llama-cpp, ik-llama-cpp,
turboquant, voxtral, etc.) all pin in their Makefile.

This change:

- Moves DS4_VERSION?= and DS4_REPO?= to the top of
  backend/cpp/ds4/Makefile.
- Inlines the git init/fetch/checkout recipe into the 'ds4:' target
  (matches llama-cpp's 'llama.cpp:' target pattern). Directory acts
  as the target so make only re-clones when missing.
- Deletes the now-redundant prepare.sh.
- Adds antirez/ds4 + DS4_VERSION + main + backend/cpp/ds4/Makefile to
  the .github/workflows/bump_deps.yaml matrix so the daily bot opens
  PRs against this pin.
- Updates .agents/ds4-backend.md to point at the Makefile.

Verified:
  $ grep -m1 '^DS4_VERSION?=' backend/cpp/ds4/Makefile
  DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
  $ make -C backend/cpp/ds4 ds4   # clones into ds4/ at the pin
  $ make -C backend/cpp/ds4 ds4   # no-op on second invocation
  make: 'ds4' is up to date.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: route backend/cpp/ds4/ changes through changed-backends.js

scripts/changed-backends.js:inferBackendPath has an explicit branch per
cpp dockerfile suffix (ik-llama-cpp, turboquant, llama-cpp). Without a
matching branch the function returns null, the backend never lands in
the path map, and PR change-detection cannot map "backend/cpp/ds4/X
changed" -> "rebuild ds4 image".

This is why PR #9761 produced zero ds4 jobs even though it directly
edits backend/cpp/ds4/Makefile.

Adds the missing branch (Dockerfile.ds4 -> backend/cpp/ds4/), placed
before the llama-cpp branch (since both share the .cpp ancestry but
ds4 is more specific - same ordering rule documented in
.agents/adding-backends.md).

Verified with a local Node simulation of the script against this PR's
diff: the path map now contains 'ds4 -> backend/cpp/ds4/' and a
'backend/cpp/ds4/Makefile' change correctly triggers the ds4 backend
in the rebuild set.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(adding-backends): harden the two gotchas that bit ds4

Both omissions are silent at the time you ADD a backend - the failure
mode only appears later (the bump bot stays silent forever, or the path
filter shows up on the next PR that touches your backend with zero CI
jobs and looks broken for unrelated reasons). Expanding the
`scripts/changed-backends.js` paragraph from a one-liner to a fully
worked example, and adding a new sibling paragraph for the
`bump_deps.yaml` + Makefile-pin contract.

Both call out the specific mistakes from the ds4 timeline (#9758#9761) so future contributors can pattern-match on the cause.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:46:02 +02:00
LocalAI [bot]
e3f9de1026 docs: ⬆️ update docs version mudler/LocalAI (#9762)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-11 22:37:06 +02:00
LocalAI [bot]
d892e4af80 feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758)
* test(e2e-backends): allow BACKEND_BINARY for native-built backends

Adds an escape hatch for hardware-gated backends (e.g. ds4) where the
model is too large for Docker build context. When BACKEND_BINARY points
at a run.sh produced by 'make -C backend/cpp/<name> package', the suite
skips docker image extraction and drives the binary directly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e-backends): validate BACKEND_BINARY basename + log actual source

Two follow-ups from the cbcf5148 code review:

- BACKEND_BINARY now requires a path whose basename is `run.sh`. Without
  this check, `filepath.Dir(binary)` silently discarded the filename, so
  pointing the env var at an arbitrary binary failed later with a
  confusing assertion that named a path the user never typed.
- The "Testing image=..." debug line printed an empty string when the
  binary path was used, hiding the actual source in CI logs. The line
  now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect
  as `src=...`.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): scaffold ds4 backend dir

Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the
implementation arrive in follow-up commits.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add backend Makefile

Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux
when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then
invokes CMake on our wrapper.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add CMakeLists for grpc-server

Generates protoc stubs from backend.proto, links grpc-server.cpp +
dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built
ds4 engine .o files. DS4_GPU=cuda|metal|cpu selects the backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): grpc-server skeleton + module stubs

The minimum that links: Backend service with Health + Free; other RPCs
default to UNIMPLEMENTED. Stub headers/sources for dsml_parser,
dsml_renderer, and kv_cache are in place so CMake links cleanly even
before those modules ship.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement LoadModel

Opens engine + creates session sized to ContextSize (default 32768).
Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else
CUDA. MTP/speculative options are accepted via ModelOptions.Options[]
(mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into
g_kv_cache_dir for the cache module (Task 19 wires it in).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement TokenizeString

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Predict (plain text)

Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement PredictStream (plain text)

ChatDelta + reasoning/tool_calls split arrives in Task 14.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): implement Status RPC

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add DSML streaming parser

Classifies raw model-emitted token text into CONTENT / REASONING /
TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the
literal DSML strings rendered by ds4_server.c's prompt template
(<|DSML|tool_calls>, <|DSML|invoke name=...>, <think>, etc.) - these are
plain text the model emits, not special tokens.

Partial markers split across token chunks are buffered until a full marker
or a definitively-not-a-marker '<' is observed. RandomToolId() generates
the API-side tool call id (call_xxx) that exact-replay would key on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes

C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape
producing byte 0xCD, eating the 'D'. The markers were never actually matching
the DSML text the model emits. Split each escape with adjacent string literal
concatenation so the byte sequence is exactly EF BD 9C 44 (|D) at runtime.

Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively
expose std::strlen / std::snprintf via <string>).

The local plan file (uncommitted) was also updated with the same fixes so
Task 16's dsml_renderer.cpp does not re-introduce the bug.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta)

Non-streaming Predict now emits one ChatDelta carrying content,
reasoning_content, and tool_calls[] parsed from the model's DSML output.
Reply.message still carries the raw model bytes for backends that prefer
the regex fallback path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire DsmlParser into PredictStream

Per-token ChatDelta writes: content/reasoning_content go incrementally,
tool_calls emit TOOL_START as one delta (id + name) followed by
TOOL_ARGS deltas with incremental JSON. The Go-side aggregator
(pkg/functions/chat_deltas.go) reassembles them.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): chat template + reasoning_effort mapping

UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append /
assistant_prefix. PredictOptions.Metadata['enable_thinking'] and
['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default;
'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE).

Tool-call rendering for assistant turns with tool_calls JSON arrives in
the next commit (dsml_renderer).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML

Closes the round-trip: when an OpenAI client sends a multi-turn chat
where prior turns contain tool_calls or role=tool messages, build_prompt
serializes them back to the DSML shape the model was trained on. Mirrors
ds4_server.c's prompt renderer; uses nlohmann::json for parsing the
OpenAI tool_calls payload.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): disk KV cache module

Dir-based cache keyed by SHA1(rendered prompt prefix). File format:
'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes
+ ds4_session_save_payload output. NOT bit-compatible with ds4-server's
KVC files - that interop is a follow-up plan. LoadLongestPrefix walks
the dir picking the longest stored prefix that prefixes the incoming
prompt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream

LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to
g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for
the request, tries LoadLongestPrefix to recover state, then Saves the
new state after generation. ds4_session_sync handles the live-cache
fast path internally, so the disk cache only matters for cold-starts
and cross-session reuse.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): add package.sh

Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into
package/lib so the FROM scratch image boots without a host libc.
Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp

ds4.h defines 'typedef enum {...} ds4_backend' which collides with our
C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h
includes ds4.h directly and surfaces the conflict immediately; other
TUs would hit it once gRPC dev headers are available.

Renames the C++ namespace to ds4cpp across all wrapper files and the
plan, leaving the upstream ds4 typedef untouched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): add Dockerfile.ds4

Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu)
-> FROM scratch with packaged grpc-server + bundled runtime libs.
nlohmann-json3-dev is required for dsml_renderer's JSON handling.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile

BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4
in docker-build-backends + .NOTPARALLEL guards. Also adds the
backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh
(landed in Task 24).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch)

Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a
multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13.
Darwin Metal is handled outside this matrix by backend_build_darwin.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add ds4 meta + image entries

cpu + cuda13 x latest + master. Darwin Metal builds publish under
ds4-darwin via the existing llama-cpp-darwin OCI pipeline.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(scripts/build): add ds4-darwin.sh

Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh:
make grpc-server -> otool -L for dylib bundling -> OCI tar that
'local-ai backends install' consumes via the backends/ds4-darwin
Makefile target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(darwin): build ds4-darwin in backend_build_darwin

Adds a 'Build ds4 backend (Darwin Metal)' step that runs the
backends/ds4-darwin Makefile target on the macOS runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(import): auto-detect ds4 weights via DS4Importer

Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf
repo URI and the DeepSeek-V4-Flash-*.gguf filename pattern. Registered before
LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling
through to llama-cpp.

Also lists ds4 in /backends/known so the /import-model UI surfaces it as a
manual choice for users who want to force the backend on a non-canonical URI.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add deepseek-v4-flash-q2 (ds4 backend)

One-click install of the q2 weights with backend: ds4.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(.agents): add ds4-backend.md

Documents the backend shape, DSML state machine, thinking-mode mapping,
disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY
hardware-validation path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps

The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to
2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported
into the environment so it can pick the right cuda-keyring / cudss / nvpl
debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not
re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern.

Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13'
failed at:
  /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/index): add Metal image entries for ds4

Adds metal-ds4 + metal-ds4-development image entries pointing at
quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4
(built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the
'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and
ds4-development variant.

Closes a gap from the initial Task 23 landing - the Darwin Metal build
script and CI workflow step were already wired (Tasks 24-25), but the
gallery had no image entry for users to install the Metal variant.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries

The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04'
which clashes with install-base-deps.sh's cuda-keyring step:

  E: Conflicting values set for option Signed-By regarding source
     https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/

The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain
'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA
from scratch via its own keyring setup. Adopting that here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): drop install-base-deps.sh dependency

The .docker/install-base-deps.sh pipeline is built around the llama-cpp
needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at
/opt/grpc. For ds4 we don't need any of that:
- CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda
  ready to go; install-base-deps's keyring step then conflicts with
  the pre-installed Signed-By.
- gRPC: ds4's grpc-server.cpp only links against grpc++; system
  libgrpc++-dev (apt) is sufficient, no source build needed.

Replaced the install-base-deps invocation in Dockerfile.ds4 with a
direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc
nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries
back to nvidia/cuda base + skip-drivers=true so install-base-deps would
no-op even if some downstream tooling calls it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus

Two compile bugs caught by the docker build:

1. proto::Message uses snake_case accessors. The build_prompt loop called
   m.toolcalls() / m.toolcallid() - the protoc-generated names are
   m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the
   wrapper.

2. The Status RPC method shadowed the 'using grpc::Status' alias, so any
   later method declaration using Status as a return type failed to parse
   ('Status does not name a type' starting at LoadModel). Solution: alias
   grpc::Status as GStatus instead, with no 'using' clause that would
   conflict. All RPC method declarations and return-statement constructions
   now use GStatus.

Pre-existing code reviewer flagged the Status-shadow concern as 'minor'
in the original Task 10 commit; it turned out to be a real compile blocker
under libstdc++ 13 once the surrounding methods were filled in.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush

When the model emitted a parameter value that arrived in the same buffer
as the surrounding tool_call markers (e.g. the buffered tail after a
literal '</think>' opened the model output), the parser deferred all
buffered bytes to Flush() because looks_like_prefix() always returns
true while buf starts with '<'. Flush() then drained the buffer as
plain CONTENT/REASONING regardless of parser state, so the bytes
between the parameter open and close markers were classified as
CONTENT instead of TOOL_ARGS.

Symptom: the model emitted

  <|DSML|parameter name="location" string="true">Paris, France</|DSML|parameter>

and the assembled tool_call arguments came out as {"location":""} -
the opener and closer were emitted into the args stream but the
"Paris, France" content went to the assistant message instead.

Fix:

1. Flush() now uses the same state-aware emit logic as DrainPlain:
   PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string),
   THINK bytes become REASONING, TEXT bytes become CONTENT, and
   INVOKE / TOOL_CALLS structural whitespace is discarded.

2. looks_like_prefix() restricts its leading-'<' fallback to buffers
   that have not yet seen a '>'. Without that change, char-by-char
   feeds would discard the '<' of '<|DSML|invoke name="..."' once
   the marker prefix length was reached but the closing quote/'>'
   were still in flight.

Verified with a standalone harness that runs the failing input three
ways (single Feed, split-after-'>', and char-by-char) and aggregates
TOOL_ARGS for tool index 0: all three now produce
{"location":"Paris, France"}.

Assisted-by: Claude:opus-4.7 [Read,Edit,Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence

ds4_engine_generate_argmax() is a self-contained helper that doesn't take or
update a ds4_session - it manages its own internal state. Our Predict and
PredictStream methods created g_session via ds4_session_create() but then
called ds4_engine_generate_argmax(), so g_session's KV state never advanced.
ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save
correctly rejected with 'session has no valid checkpoint to save'.

Switch both RPCs to the proper session API:
  ds4_session_sync(g_session, &prompt, ...)
  loop:
    int token = ds4_session_argmax(g_session)
    if token == eos: break
    emit(token)
    ds4_session_eval(g_session, token, ...)

After the loop the session has a real checkpoint and ds4_session_save_payload
writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three
.kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets
kv_cache_dir, and the e2e tool-call assertion still passes.

Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save
path + payload_bytes + result) so future failures are visible instead of
silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream
when the cache is enabled - and skipped entirely when the option is unset.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded

Wires MTP (Multi-Token Prediction) speculative decoding into the manual
generation loop in both Predict and PredictStream. When the upstream MTP
weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal,
ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to
ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per
verifier step. When MTP is not loaded (no option, CPU backend, or weights
absent), we fall through to the simple ds4_session_argmax + ds4_session_eval
path with no behavior change.

Validated on a DGX Spark GB10 with the optional MTP GGUF
(DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs
'ds4: MTP support model loaded ... (draft=2)' on stderr.

Caveat per upstream README: 'currently provides at most a slight speedup,
not a meaningful generation-speed win'. Wired now mainly to track the
upstream API; bigger speedups arrive when ds4 improves the speculative path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override

Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI
gRPC side. The generation loop now consults compute_sample_params() per
token to pick the effective (temperature, top_k, top_p, min_p), based on:

  1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp
  2. Thinking-mode override: when enable_thinking != false, force T=1.0,
     top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and
     the trailing content)
  3. DSML structural override: when DsmlParser::IsInDsmlStructural()
     returns true (we are between tool-call markers but NOT in a param
     value payload), force T=0.0 so protocol bytes parse cleanly

When the effective temperature is 0, we keep using ds4_session_argmax +
MTP speculative path (matches ds4-server's gate that only enables MTP for
greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with
a per-thread RNG seeded from system_clock and fall back to single-token
ds4_session_eval.

New public method on DsmlParser: IsInDsmlStructural() encodes which states
need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user
sampling); TEXT and THINK are excluded (no tool-call context to protect).

Verified on the DGX Spark GB10: the e2e suite still passes with all 5
specs including tools, and the Predict output now varies between runs
(creative sampling active) while the tool-call args remain a clean
'{"location":"Paris, France"}' because the parser-state check forces
greedy on the structural bytes.

UX note: thinking mode is ON by default (matching ds4-server). Users who
want deterministic output should set Metadata.enable_thinking = false.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add sha256 to deepseek-v4-flash-q2 entry

Per HF LFS metadata for antirez/deepseek-v4-gguf:
  size: 86720111200 bytes (~80.76 GiB)
  sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c

LocalAI's downloader verifies sha256 when present, so users who install
deepseek-v4-flash-q2 from the gallery get integrity-checked weights and
the partial-download issue (an 81 GB file is easy to truncate) becomes
recoverable instead of silently producing a broken backend.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 22:15:47 +02:00
dependabot[bot]
5d0f732b16 chore(deps): bump the go_modules group across 1 directory with 2 updates (#9759)
Bumps the go_modules group with 2 updates in the / directory: [github.com/gofiber/utils](https://github.com/gofiber/utils) and [github.com/go-git/go-git/v5](https://github.com/go-git/go-git).


Updates `github.com/gofiber/utils` from 1.1.0 to 1.2.0
- [Release notes](https://github.com/gofiber/utils/releases)
- [Commits](https://github.com/gofiber/utils/compare/v1.1.0...v1.2.0)

Updates `github.com/go-git/go-git/v5` from 5.18.0 to 5.19.0
- [Release notes](https://github.com/go-git/go-git/releases)
- [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md)
- [Commits](https://github.com/go-git/go-git/compare/v5.18.0...v5.19.0)

---
updated-dependencies:
- dependency-name: github.com/gofiber/utils
  dependency-version: 1.2.0
  dependency-type: indirect
  dependency-group: go_modules
- dependency-name: github.com/go-git/go-git/v5
  dependency-version: 5.19.0
  dependency-type: indirect
  dependency-group: go_modules
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-11 18:37:00 +02:00
Ettore Di Giacinto
ea00199554 ci: tag every backend digest, including singletons
backend_build.yml pushes by canonical digest only (push-by-digest=true,
no tags applied at build time). User-facing tagging happens in
backend_merge.yml's `imagetools create` step. Before this commit,
scripts/changed-backends.js emitted a merge entry only for tag-suffixes
with 2+ legs, so every single-arch backend (CUDA/ROCm/Intel Python
images, vLLM, sglang, transformers, diffusers, ...) pushed its digest
untagged and stayed that way until quay's GC reaped it. Symptom: tag
releases shipped multi-arch backends tagged correctly, but no
v<X>-gpu-nvidia-cuda-12-vllm (or any singleton variant) ever appeared
in the registry.

Changes:

- scripts/changed-backends.js drops the `group.length < 2` skip and
  emits two merge matrices, one per arch class, so each downstream
  merge job can `needs:` only its corresponding build matrix.
- backend.yml splits backend-merge-jobs into multiarch and singlearch
  variants. The split preserves PR #9746's fix: slow singlearch CUDA
  builds (~6h) must not gate multiarch merges, or quay's GC reaps the
  multiarch per-arch digests before they're tagged.
- backend_pr.yml mirrors the split.
- backend_build.yml renames the digest artifact from
  `digests<suffix>-<platform-tag>` to
  `digests<suffix>--<platform-tag-or-"single">`. The `--` separator
  prevents the merge-side glob from over-matching sibling backends
  whose tag-suffix is a prefix of ours (e.g. -cpu-vllm vs
  -cpu-vllm-omni, -cpu-mlx vs -cpu-mlx-audio); the `single` placeholder
  keeps the name well-formed when platform-tag is empty.
- backend_merge.yml updates the download pattern to match.

Verified locally: a tag-push event now expands to 36 multiarch merge
entries (= 72 builds / 2 legs) and 199 singlearch merge entries (one
per singleton, including -gpu-nvidia-cuda-12-vllm at index 24).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-11 13:22:00 +00:00
LocalAI [bot]
b9e81dbfd4 chore: ⬆️ Update ggml-org/llama.cpp to 389ff61d77b5c71cec0cf92fe4e5d01ace80b797 (#9752)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-11 08:14:07 +02:00
Ettore Di Giacinto
059c493641 ci(darwin): brew reinstall ccache to handle transitive dep drift
Symptom (PR #9752, run 25638825961, job 75256261163):

  dyld[11144]: Library not loaded: /opt/homebrew/opt/fmt/lib/libfmt.12.dylib
    Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
  Abort trap: 6

Previous fix (commit 3f6e4934) added blake3, hiredis, xxhash, zstd as
explicit installs + cache paths because ccache's runtime dep closure
wasn't in the brew cache. But ccache 4.13 also depends on fmt — which
I missed. This is going to keep happening as upstream ccache adds or
shuffles deps over time.

Durable fix: `brew reinstall ccache` after the install step forces
brew to re-resolve and install ccache's full transitive dep closure
every run, immune to future formula changes. The brew downloads cache
makes the reinstall cheap (~5s on a cache hit).

Also adds fmt to the explicit install/link/Cellar-cache lists for the
fresh-runner path. The reinstall covers the cache-hit path; the
explicit install covers the brand-new-runner path where neither the
downloads cache nor the Cellar cache has been populated yet.

Caught by PR #9752's CI; would also have caught any future
LLAMA_VERSION bump triggering the Darwin matrix.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 21:17:30 +00:00
LocalAI [bot]
19d59102d5 feat(whisper-cpp): implement streaming transcription (#9751)
* test(whisper): wire e2e streaming transcription target

Adds test-extra-backend-whisper-transcription, mirroring the existing
llama-cpp / sherpa-onnx / vibevoice-cpp targets. The generic
AudioTranscriptionStream spec at tests/e2e-backends/backend_test.go:644
fails today because backend/go/whisper has no streaming impl - this
target is the failing TDD gate that the next phase makes pass.

Confirmed RED locally: 3 Passed (health, load, offline transcription),
1 Failed (streaming spec hits its 300s context deadline because the
base implementation returns 'unimplemented' but doesn't close the
result channel, leaving the gRPC stream open until the client times
out).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): expose new_segment_callback to the Go side

Adds set_new_segment_callback() and a C-side trampoline that whisper.cpp
invokes once per new text segment during whisper_full(). The trampoline
dispatches (idx_first, n_new, user_data) to a Go function pointer
registered via purego.NewCallback - text and timings are pulled by Go
through the existing get_segment_text/get_segment_t0/get_segment_t1
getters.

Wires the hook only when streaming is actually requested, to avoid a
per-segment function-pointer dispatch on the offline path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper-cpp): implement AudioTranscriptionStream

Wires whisper.cpp's new_segment_callback through purego back to Go so
the streaming transcription RPC produces real, time-correlated deltas
while whisper_full() is still decoding. Each segment becomes one
TranscriptStreamResponse{Delta}; whisper_full's return is the
TranscriptStreamResponse{FinalResult} carrying the full segment list,
language, and duration.

Per-call state is tracked in a sync.Map keyed by an atomic counter; the
Go callback registered via purego.NewCallback is a singleton, dispatched
through user_data. SingleThread today means only one entry is ever live,
but the map shape matches the sherpa-onnx TTS callback pattern.

The streaming path's final.Text is the literal concat of every emitted
delta (a strings.Builder accumulated by onNewSegment) so the e2e
invariant `final.Text == concat(deltas)` holds exactly. The first delta
has no leading space; subsequent deltas are space-prefixed. The offline
AudioTranscription path is unchanged.

Closes the gap with sherpa-onnx, vibevoice-cpp, llama-cpp, and tinygrad,
which already implement AudioTranscriptionStream.

Verified GREEN locally: make test-extra-backend-whisper-transcription
passes 4/4 specs (3 Passed initially under RED, +1 streaming spec now).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper-cpp): assert progressive multi-segment streaming

Drives AudioTranscriptionStream against a real long-audio fixture and
asserts len(deltas) >= 2. The generic e2e spec at
tests/e2e-backends/backend_test.go:644 only checks len(deltas) >= 1
which is satisfied by both real and faked streaming - this spec is the
guardrail that a future "fake" impl can't sneak past.

Skipped by default (env-gated, like the cancellation spec); set
WHISPER_LIBRARY, WHISPER_MODEL_PATH, and WHISPER_AUDIO_PATH to a 30+
second clip to run.

Verified locally with a 55s 5x-JFK concat against ggml-base.en.bin:
1 Passed in 7.3s, deltas >= 2, finalSegmentCount >= 2,
concat(deltas) == final.Text.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(whisper-cpp): add transcription gRPC e2e job

Mirrors tests-sherpa-onnx-grpc-transcription /
tests-llama-cpp-grpc-transcription. Runs make
test-extra-backend-whisper-transcription whenever the whisper backend
or the run-all switch fires, so a pin-bump or refactor that breaks
streaming transcription gets caught before merge.

The whisper output on detect-changes is already emitted by
scripts/changed-backends.js (it iterates allBackendPaths); this PR
just exposes it as a workflow output and consumes it.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper-cpp): silence errcheck on AudioTranscriptionStream defers

golangci-lint runs with new-from-merge-base=origin/master, so the
identical defer patterns in the existing offline AudioTranscription
path are grandfathered while the new ones in AudioTranscriptionStream
trip errcheck. Wrap both defers in `func() { _ = ... }()` to match what
errcheck wants without altering behavior. The errors from os.RemoveAll
and *os.File.Close are not actionable inside a defer here (we're
already returning), matching the offline path's contract.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 23:11:46 +02:00
LocalAI [bot]
4715a68660 chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.20.2 (#9750)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:33:07 +02:00
LocalAI [bot]
28f33be48f chore: ⬆️ Update ggml-org/whisper.cpp to c33c5618b72bb345df029b730b36bc0e369845a3 (#9749)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:32:47 +02:00
LocalAI [bot]
a435f7cc69 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 23127139cb6fa314899c3b5f4935b88b3374c56c (#9748)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-10 21:32:28 +02:00
LocalAI [bot]
f6c9c20911 chore: ⬆️ Update ggml-org/llama.cpp to 2b2babd1243c67ca811c0a5852cedf92b1a20024 (#9747)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 21:17:38 +02:00
Ettore Di Giacinto
3f6e493439 ci(darwin): install ccache's runtime dylib deps (blake3, hiredis, xxhash, zstd)
Symptom (run 25634195866, job 75244019809): the Configure ccache step
on the Darwin llama-cpp build aborted with:

  dyld[5647]: Library not loaded: /opt/homebrew/opt/blake3/lib/libblake3.0.dylib
    Referenced from: /opt/homebrew/Cellar/ccache/4.13.5/bin/ccache
  Abort trap: 6

The previous Darwin fix (acc5588d) addressed missing /opt/homebrew/bin
symlinks after a brew cache restore by force-linking. This is a
different layer: ccache's Cellar dir IS restored from cache and IS
linked, but ccache 4.13 dynamically links against blake3 / hiredis /
xxhash / zstd at runtime, and those dependencies are NOT in the
restored Cellar paths. brew install ccache sees the ccache Cellar
present and skips the install — including skipping installation of
those transitive deps.

Two-part fix:
  - Add /opt/homebrew/Cellar/{blake3,hiredis,xxhash,zstd} to the brew
    cache restore/save paths so future cache-hit runs restore them.
  - Explicitly install + link them in the Dependencies step so even
    a fresh runner (cache miss on a new key) gets them, and brew has
    them on hand for ccache to dlopen.

Caught by run 25634195866. Pre-existing condition on Darwin runners;
surfaced because Darwin builds run more often after the llama-cpp-
darwin consolidation in #9731.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 17:09:01 +00:00
LocalAI [bot]
35f6db8c76 ci: split backend-jobs into single-arch and multi-arch matrices (#9746)
Symptom (run 25612992409): backend-merge-jobs failed with
"quay.io/go-skynet/local-ai-backends@sha256:fdbd93ca...: not found"
even though the per-arch build for -cpu-llama-cpp pushed that exact
digest 14h31m earlier.

Root cause: backend-merge-jobs was gated on the WHOLE backend-jobs
matrix (`needs: backend-jobs`). The multi-arch -cpu-llama-cpp legs
finished within 30 min, but a single-arch CUDA-12-llama-cpp slot in
the same matrix queued for ~8h (max-parallel: 8 throttle) and then
took ~6h to build cold. By the time it freed the merge to run, quay's
GC had reaped the per-arch digests pushed by the fast multi-arch legs
the day before.

Fix: split the linux backend matrix in two.

  backend-jobs-multiarch  - entries with `platform-tag` set (paired
    per-arch legs that feed backend-merge-jobs).
  backend-jobs-singlearch - entries without `platform-tag` (heavy
    standalone builds: CUDA, ROCm, Intel oneAPI, vLLM, sglang, etc.).

backend-merge-jobs now `needs:` only backend-jobs-multiarch. The
multi-arch matrix completes in ~2-3h, well inside quay's GC window.
Heavy single-arch entries keep running independently with no merge
dependency.

scripts/changed-backends.js gains a splitByArch() helper that
partitions filtered entries by whether `platform-tag` is set, and
emits matrix-singlearch + matrix-multiarch + has-backends-singlearch
+ has-backends-multiarch outputs (replacing the previous combined
matrix / has-backends pair). Applied in both the full-matrix and
filtered-matrix code paths. Smoke test: 199 single-arch + 72 multi-
arch + 35 darwin = 271 total entries; 36 merge-matrix entries
(one per multi-arch backend pair). Matches expectation.

Local `make backends/<name>` is unaffected — the script's outputs
only feed CI workflow matrices.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 18:15:53 +02:00
Ettore Di Giacinto
6113e5a4d0 docs(ci-caching): list all paths that retrigger base-images.yml
Now that base-images.yml's master-push trigger includes the install
script and apt-mirror script (commit 7fff8584), enumerate them in
the workflow-surfaces table instead of the vaguer "base Dockerfile/
workflow" phrasing.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 22:31:37 +00:00
Ettore Di Giacinto
7fff858408 ci(base-images): also trigger rebuild on .docker/install-base-deps.sh changes
base-images.yml's master-push trigger had a path filter listing only
backend/Dockerfile.base-grpc-builder and .github/workflows/base-images.yml.
That misses .docker/install-base-deps.sh — which is the actual source
of truth for what goes into each base image (apt deps, gRPC, conditional
CUDA/ROCm/Vulkan installs). The script is bind-mounted into the base
Dockerfile at build time; changes to it would change the produced
images, but without this path filter, the workflow wouldn't auto-rebuild
on those changes. Stale bases would persist until Saturday's cron or a
manual workflow_dispatch.

Same applies to .docker/apt-mirror.sh, also bind-mounted by the base
Dockerfile.

Add both to the trigger paths so consumer-affecting changes to either
file rebuild the bases automatically.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 22:30:46 +00:00
LocalAI [bot]
6fd21d5cf3 docs(agents): update CI caching docs after the GHA-free-tier migration (#9742)
The migration shipped over a sequence of PRs (#9726#9727#9730#9731#9737#9738 plus a handful of direct-to-master fixes) and
left the .agents/ docs significantly out of date.

Updated:

- .agents/ci-caching.md (significant rewrite)
  - Cache key shape: now includes per-arch suffix (cache<suffix>-<arch>).
  - New "Workflow surfaces" overview table.
  - New "Pre-built base images (base-grpc-*)" section covering the 10
    quay.io/go-skynet/ci-cache:base-grpc-* tags, the multi-target
    Dockerfile pattern (builder-fromsource / builder-prebuilt /
    aliasing FROM), the BUILDER_BASE_IMAGE → BUILDER_TARGET derivation,
    the bootstrap-on-branch order for new variants.
  - New "Per-arch native builds + manifest merge" section: split
    matrix entries, push-by-digest, backend_merge.yml, why
    provenance: false matters.
  - New "Path filter on master push" section: changed-backends.js
    handles push events via the Compare API; weekly Sunday cron is
    the safety net for unpinned Python deps.
  - New "ccache for C++ backend builds" section.
  - New "Composite actions" section: free-disk-space and
    setup-build-disk.
  - New "Concurrency" section documenting the per-PR-per-commit group
    fix.
  - Darwin section gains the brew link --overwrite note (after-
    cache-restore symlinks weren't restored) and the llama-cpp-darwin
    consolidation context.
  - "Self-hosted runners" section confirming the matrix is free of
    arc-runner-set / bigger-runner references except the residual
    test-extra.yml vibevoice case.
  - "Touching the cache pipeline" rule list extended (provenance,
    install-base-deps.sh single-source-of-truth, base-images bootstrap
    order).

- .agents/adding-backends.md
  - Section 2 title: backend.yml -> backend-matrix.yml (path moved).
  - New paragraph on per-arch entries (platform-tag + paired matrix
    rows + auto-firing merge job).
  - New paragraph on builder-base-image for llama-cpp / ik-llama-cpp /
    turboquant.
  - Final checklist line updated accordingly.

- .agents/building-and-testing.md
  - Reference: backend.yml -> backend-matrix.yml.
  - Note about builder-base-image and BUILDER_TARGET defaulting to
    builder-fromsource for local builds.

- AGENTS.md
  - One-line description update for ci-caching.md to mention the new
    infrastructure (per-arch keys, base-grpc-*, manifest-merge,
    setup-build-disk, path filter).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 00:28:57 +02:00
LocalAI [bot]
6cbf69dc29 chore: ⬆️ Update ggml-org/llama.cpp to 1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0 (#9739)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-10 00:06:29 +02:00
LocalAI [bot]
593f3a8648 ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) (#9738)
* ci(backend_build): plumb builder-base-image and BUILDER_TARGET build-args

Adds an optional builder-base-image input. When set, BUILDER_BASE_IMAGE
is forwarded as a build-arg AND BUILDER_TARGET=builder-prebuilt is set
to select the variant Dockerfile's prebuilt-base stage. When empty,
BUILDER_TARGET=builder-fromsource (the default) keeps the existing
from-source build path.

This makes the prebuilt-base optimization opt-in per matrix entry
without breaking local `make backends/<name>` invocations or backends
whose Dockerfile doesn't have a prebuilt path.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(llama-cpp,ik-llama-cpp,turboquant): multi-target Dockerfiles for prebuilt + from-source

Restructure the three llama.cpp-derived Dockerfiles so each supports
two builder paths in a single file, selected via the BUILDER_TARGET
build-arg:

  BUILDER_TARGET=builder-fromsource (default)
    - Standalone build: gRPC stage + apt installs + (conditionally)
      CUDA/ROCm/Vulkan + compile.
    - Used by `make backends/llama-cpp` locally and any caller that
      doesn't supply a prebuilt base.

  BUILDER_TARGET=builder-prebuilt
    - FROM \${BUILDER_BASE_IMAGE} (one of quay.io/go-skynet/ci-cache:
      base-grpc-* shipped in PR #9737).
    - Skips ~25-35 min of gRPC compile + ~5-10 min of toolchain installs.
    - Used by CI when the matrix entry sets builder-base-image.

Final FROM scratch resolves BUILDER_TARGET via an aliasing FROM stage
(BuildKit doesn't support variable expansion directly in COPY --from),
then COPY --from=builder pulls package output from the chosen path.
BuildKit prunes the unreferenced builder, so each build only does the
work for the chosen path.

The compile RUN is identical between both builder stages, so it's
factored into .docker/<name>-compile.sh and bind-mounted into both.
ccache mount + cache-id stay per-arch / per-build-type.

Local DX preserved: `make backends/llama-cpp` (no extra args) defaults
to BUILDER_TARGET=builder-fromsource and works exactly as before.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend.yml,backend_pr.yml): forward builder-base-image from matrix

Plumbs the new optional builder-base-image input from matrix into
backend_build.yml. backend_build.yml derives BUILDER_TARGET from
whether builder-base-image is set, so matrix entries that map to a
prebuilt base get the prebuilt path; entries that don't (python/go/
rust backends) fall through to the default builder-fromsource (which
their own Dockerfiles don't reference, so it's a no-op for them).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(backend-matrix): wire builder-base-image to llama-cpp variants

For every entry whose Dockerfile is llama-cpp/ik-llama-cpp/turboquant,
add a builder-base-image field pointing at the appropriate prebuilt
quay.io/go-skynet/ci-cache:base-grpc-* tag.

backend_build.yml derives BUILDER_TARGET from this field's presence:
non-empty -> builder-prebuilt; empty -> builder-fromsource. So this
commit alone activates the prebuilt-base path for these 23 backends
in CI, while local `make backends/<name>` (no extra args) keeps the
from-source path.

Mapping by (build-type, arch):
- '' / amd64        -> base-grpc-amd64
- '' / arm64        -> base-grpc-arm64
- cublas-12 / amd64 -> base-grpc-cuda-12-amd64
- cublas-13 / amd64 -> base-grpc-cuda-13-amd64
- cublas-13 / arm64 -> base-grpc-cuda-13-arm64
- hipblas / amd64   -> base-grpc-rocm-amd64
- vulkan / amd64    -> base-grpc-vulkan-amd64
- vulkan / arm64    -> base-grpc-vulkan-arm64
- sycl_* / amd64    -> base-grpc-intel-amd64
- cublas-12 + JetPack r36.4.0 / arm64 -> base-grpc-l4t-cuda-12-arm64

Cold-build savings expected: ~25-35 min per variant (skips the gRPC
compile + toolchain install that's now in the base).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add base-grpc-l4t-cuda-12-arm64 variant for legacy JetPack entries

Two matrix entries (-nvidia-l4t-arm64-llama-cpp, -nvidia-l4t-arm64-
turboquant) build against nvcr.io/nvidia/l4t-jetpack:r36.4.0 + CUDA
12 ARM64. They're distinct from -nvidia-l4t-cuda-13-arm64-* which use
Ubuntu 24.04 + CUDA 13 sbsa. Add the missing JetPack-based variant
to base-images.yml so those two entries' builder-base-image mapping
in the previous commit resolves.

Bootstrap order before merging this PR (re-run base-images.yml on
this branch — 9 existing variants hit BuildKit cache, only the new
l4t-cuda-12-arm64 builds cold):

  gh workflow run base-images.yml --ref ci/base-images-consumers

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: extract base-builder install logic into .docker/install-base-deps.sh

Pre-extraction, the apt + protoc + cmake + conditional CUDA/ROCm/Vulkan
+ gRPC install logic was duplicated across four files:
  - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth)
  - backend/Dockerfile.llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage)
  - backend/Dockerfile.turboquant (builder-fromsource stage)

A bump to e.g. CUDA toolkit packages had to be made in 4 places, and
drift between the prebuilt base and the variant-Dockerfile from-source
path was a real concern (ik-llama-cpp's hipblas branch was already
missing the rocBLAS Kernels echo that llama-cpp / turboquant /
base-grpc-builder all had).

Factor the install logic into a single .docker/install-base-deps.sh
that reads its inputs from env vars and runs conditionally on
BUILD_TYPE / CUDA_*_VERSION / TARGETARCH. Each Dockerfile now bind-
mounts the script alongside .docker/apt-mirror.sh and invokes it from
a single RUN step.

The variant Dockerfiles' grpc-source stage is removed entirely — the
script handles gRPC compile + install at /opt/grpc, and the
builder-fromsource stage mirrors builder-prebuilt by copying
/opt/grpc/. to /usr/local/.

Result:
  - install-base-deps.sh: 244 lines (one source of truth)
  - Dockerfile.base-grpc-builder: 268 -> 98 lines
  - Dockerfile.llama-cpp: 361 -> 157 lines
  - Dockerfile.ik-llama-cpp: 348 -> 151 lines
  - Dockerfile.turboquant: 355 -> 154 lines
  - Total Dockerfile bytes: 1332 -> 560 lines (58% reduction)

Bit-equivalence between prebuilt and from-source paths is now enforced
by construction: both invoke the same script with the same inputs.
A side-effect is that ik-llama-cpp now also gets the rocBLAS Kernels
echo + clblas block parity it was previously missing.

Includes the BUILD_TYPE=clblas branch (libclblast-dev) for parity even
though no current CI matrix entry uses it.

After this commit's force-push, base-images.yml needs to be redispatched
on this branch — the Dockerfile.base-grpc-builder content shifts so the
existing cache won't apply for the install layer (gRPC layer also
rebuilds since it's now in the same RUN step).

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(base-images): skip-drivers on JetPack l4t variant

cuda-nvcc-12-0 isn't installable via apt on the JetPack r36.4.0 base
image — JetPack ships CUDA preinstalled at /usr/local/cuda and its
apt feed doesn't carry the cuda-nvcc-* packages from the public
repositories. The original matrix entry for -nvidia-l4t-arm64-llama-cpp
on master sets skip-drivers: 'true' for exactly this reason; the
new base-grpc-l4t-cuda-12-arm64 base needs to match.

Also forwards SKIP_DRIVERS as a build-arg from matrix into the build
(was missing entirely before this commit).

Caught by run 25612030775 — l4t-cuda-12-arm64 failed at:
  E: Package 'cuda-nvcc-12-0' has no installation candidate

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-10 00:03:52 +02:00
Ettore Di Giacinto
acc5588d2c ci(darwin): force-link brew formulas after cache restore
Symptom: `ccache: command not found` in the Configure ccache step on
runs that hit the brew cache.

Root cause: actions/cache restores /opt/homebrew/Cellar/<formula> but
NOT the bin symlinks at /opt/homebrew/bin/*. The subsequent
`brew install` sees the Cellar entries present and decides "already
installed" — without re-running the link step. So on cache-hit runs
none of the cached formulas are actually on PATH.

Fix: explicit `brew link --overwrite` for every formula we install,
right after `brew install`. --overwrite tolerates leftover symlinks
from a partial earlier install. The 2>/dev/null + || true keeps the
step from failing if a formula is already correctly linked.

Pre-existing flake; surfaces more often as Darwin matrix coverage
grows after the llama-cpp-darwin consolidation in #9731.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 20:41:03 +00:00
LocalAI [bot]
28e29625a2 ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) (#9737)
Introduces a parameterized Dockerfile.base-grpc-builder that produces
a fully-prepped builder base image (apt deps + protoc + cmake + gRPC
at /opt/grpc + conditional CUDA/ROCm/Vulkan toolchains) and a
base-images.yml workflow that builds + pushes 9 variants to
quay.io/go-skynet/ci-cache:base-grpc-*:

  base-grpc-amd64                 (Ubuntu 24.04, CPU-only)
  base-grpc-arm64                 (Ubuntu 24.04, CPU-only)
  base-grpc-cuda-12-amd64         (Ubuntu 24.04 + CUDA 12.8)
  base-grpc-cuda-13-amd64         (Ubuntu 22.04 + CUDA 13.0)
  base-grpc-cuda-13-arm64         (Ubuntu 24.04 + CUDA 13.0 sbsa)
  base-grpc-rocm-amd64            (rocm/dev-ubuntu-24.04:7.2.1 + hipblas)
  base-grpc-vulkan-amd64          (Ubuntu 24.04 + Vulkan SDK 1.4.335)
  base-grpc-vulkan-arm64          (Ubuntu 24.04 + Vulkan SDK ARM 1.4.335)
  base-grpc-intel-amd64           (intel/oneapi-basekit:2025.3.2)

The variant Dockerfiles (Dockerfile.llama-cpp, ik-llama-cpp, turboquant)
are NOT touched in this PR. PR 2 will refactor them to FROM these
prebuilt bases. This PR is intentionally inert - landing it changes no
existing CI behavior. The base images don't exist on quay until
someone manually triggers the workflow.

Bootstrap after merge:
  gh workflow run base-images.yml --ref master
Wait ~30 min for all 9 variants to push, then merge PR 2 (the
consumer-side refactor that uses BUILDER_BASE_IMAGE build-arg to
FROM these tags).

Triggers afterwards:
  - Saturdays 05:00 UTC (cron) - picks up upstream security updates,
    runs ~24h before the backend.yml Sunday cron so bases are fresh.
  - workflow_dispatch - manual ad-hoc rebuild.
  - master push touching Dockerfile.base-grpc-builder or this workflow.

Why split into two PRs: the variant Dockerfiles in PR 2 will FROM the
prebuilt bases and have no from-source fallback. Their CI builds fail
if the bases don't exist on quay yet. Landing infrastructure first +
manual bootstrap + then consumer refactor avoids a broken-master window.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 18:44:42 +02:00
Ettore Di Giacinto
31aa0582a5 ci(ik-llama-cpp,turboquant): add BuildKit ccache mount to compile steps
Mirror the ccache mount added to Dockerfile.llama-cpp in 9228e5b4 for
the other two llama.cpp-derived backends. Same shape, distinct mount
ids so each backend's cache is independent:

  ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE}
  turboquant-ccache-${TARGETARCH}-${BUILD_TYPE}

ik_llama.cpp is a different upstream fork; no source overlap with
llama-cpp, separate cache makes sense.

turboquant is a llama.cpp fork that reuses backend/cpp/llama-cpp
source via a thin wrapper Makefile — most TUs would in principle hit
llama-cpp's ccache too. Keeping them separate for now to avoid one
fork's regressions poisoning the other; revisit sharing after we have
hit-rate numbers.

Same registry-export behavior as llama-cpp: the cache mount rides on
backend_build.yml's existing cache-to: type=registry,mode=max.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:21:49 +00:00
Ettore Di Giacinto
3568b2819d fix(gallery): keep auto-upgrade off non-dev backends when -development is installed (#9736)
A `-development` backend variant (e.g. `cuda12-llama-cpp-development`)
shares its `alias` with the stable counterpart and is meant to be a
drop-in replacement via ListSystemBackends alias resolution. Two paths
in the auto-upgrade flow let the stable variant slip back in on top of
the user's explicit dev pick:

1. ListSystemBackends emits a synthetic alias row keyed by the alias
   name that re-uses the chosen concrete's metadata pointer. In
   distributed mode, the worker's handleBackendList serialised that
   row over NATS as `{Name: <alias>, URI: <dev URI>, Digest: <dev>}`
   — the frontend can't reconstruct the alias relationship, and the
   wire-rebuilt row then carried `Metadata.Name = <alias>` and
   resolved against an unrelated gallery entry on the next upgrade
   check.
2. CheckUpgradesAgainst happily iterated the synthetic row in
   single-node too. Today the duplicate gallery lookup is harmless
   because both rows share the same `Metadata.Name`, but any gallery
   change that gives a meta backend a version, or any concrete
   sharing its alias with a dev counterpart, would surface a phantom
   non-dev upgrade and auto-upgrade would install it — shadowing the
   dev one through alias-token preference.

Two layered fixes:

- `core/services/worker/lifecycle.go` (`handleBackendList`): drop
  rows where the map key differs from `b.Metadata.Name`. Concrete
  and meta entries always have `key == Metadata.Name`; only synthetic
  aliases violate it. Workers now report only what's actually on disk;
  the per-node UI listing and CheckUpgrades both stop seeing phantoms.
- `core/gallery/upgrade.go` (`CheckUpgradesAgainst`): iterate by key,
  skip rows where `key != Metadata.Name` (belt-and-suspenders for any
  caller-supplied installed set), and apply the dev-aware rule —
  build a set of installed `Metadata.Name`s and drop any non-dev
  candidate `X` whose `X-<devSuffix>` counterpart is installed. Uses
  the configured dev suffix from `getFallbackTagValues(systemState)`.

Manual `POST /api/backends/upgrade/<name>` is unaffected: it goes
straight through `bm.UpgradeBackend(name)` without consulting the
suppression list, so users who genuinely want the stable variant
upgraded can still trigger it explicitly.

Tests in core/gallery/upgrade_test.go cover three cases under
"CheckUpgradesAgainst (distributed)": dev-only installed → only the
dev surfaces; both variants installed → dev still wins; synthetic
alias row is ignored. Generic backend names are used to avoid the
capability filter dropping cuda-prefixed entries on a CPU-only host.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 18:20:00 +02:00
Ettore Di Giacinto
9228e5b412 ci(llama-cpp): add BuildKit ccache mount to the compile step
The big RUN at line 268 of Dockerfile.llama-cpp re-runs from scratch on
every LLAMA_VERSION bump (or any LocalAI source change due to
COPY . /LocalAI just before). For CUDA-13 specifically that compile
recently hit the GHA 6h hard limit and failed:

  https://github.com/mudler/LocalAI/actions/runs/25598418931/job/75148244557

Add a BuildKit cache mount on /root/.ccache and thread ccache through
CMake (CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER) so most translation units
hit cache when their preprocessed source is byte-identical to the
previous build.

The cache mount is exported to the registry as part of the existing
cache-to: type=registry,mode=max in backend_build.yml, so it persists
across runs. mount id is keyed on TARGETARCH + BUILD_TYPE so different
variants don't thrash the same cache slot; sharing=locked serializes
concurrent writes.

Cold-build effect (first run after enable, or on LLAMA_VERSION bump
that touches every TU): unchanged. Hot-build effect (subsequent runs
with the same source, or LLAMA_VERSION bumps that touch a handful of
files): ~5-15 min for the llama.cpp compile vs the previous 1-3h cold.
For CUDA-13 specifically this should bring rebuilds well under the 6h
GHA limit.

Does NOT help the *first* post-bump build — that's still cold. For
that, follow-up work would be: (a) trim CUDA_DOCKER_ARCH to modern
GPUs only, (b) audit which CMake variants the published images
actually need, (c) pre-built CUDA+gRPC base image.

ccache package is already installed in the builder stage (line 90).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 16:16:46 +00:00
LocalAI [bot]
a91e718473 chore: ⬆️ Update ggml-org/llama.cpp to 00d56b11c3477b99bc18562dc1d1834f0d961778 (#9733)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-09 12:05:11 +02:00
Ettore Di Giacinto
6d2b7d893a ci: drop paths-ignore from test.yml and tests-e2e.yml
These workflows are configured as required status checks in branch
protection. With paths-ignore matching the PR diff, the workflow
doesn't trigger and no status is reported — branch protection then
blocks the PR with "Expected — Waiting for status to be reported"
indefinitely. Especially common for backend-only PRs since the ignore
list included backend/**.

Run the full test suite on every PR. Cost is ~5 min per PR for
tests-linux + ~similar for tests-apple + the e2e backend smoke; small
trade for unblocking PR merges.

Workflows affected:
- tests-linux (1.26.x), tests-apple (1.26.x) in test.yml
- tests-e2e-backend (1.25.x) in tests-e2e.yml

Other workflows that still have paths-ignore (none currently in the
required-checks list) are left as-is — adding them to required later
would re-introduce the same problem.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 09:23:51 +00:00
LocalAI [bot]
d1eef05852 chore: ⬆️ Update ikawrakow/ik_llama.cpp to ab0f22b819ac57b7e7484f69c00c10fc755d5c6c (#9734)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 11:18:59 +02:00
Ettore Di Giacinto
5a12392570 ci(concurrency): make cancel-in-progress event-aware, group by sha on push
Yesterday two PRs (#9724 llama.cpp bump, #9731 llama-cpp-darwin
consolidation) merged 11 seconds apart. Both shared the same
backend.yml concurrency group (ci-backends-refs/heads/master-...) due
to "${{ github.head_ref || github.ref }}" — empty head_ref on push
events falls through to the static refs/heads/master. With
cancel-in-progress: true that meant the second merge cancelled the
first's in-flight backend builds. The first PR's CI never finished;
the second PR only touched CI files so its run was a no-op.

Two changes per workflow:
- group: replace "${{ github.head_ref || github.ref }}" with
  "${{ github.event.pull_request.number || github.sha }}". On PRs
  this groups by PR number (same as before, just keyed on number not
  branch name); on push events it groups per-commit, so two master
  pushes never share a group.
- cancel-in-progress: gate on github.event_name == 'pull_request' so
  rapid pushes to a PR still cancel old runs (newer push wins) but
  master pushes never cancel each other.

Trade-off vs alternatives:
- Merge queue would also solve this and additionally test the merged
  commit before it lands. Heavier process change; out of scope here.
- Allowing per-commit master concurrency means two simultaneous master
  runs may overlap and race on tag pushes, but each commit's manifest
  digest is unique and the registry is last-writer-wins on tags —
  newer commit's tag overwrites older.

Applied to 11 workflows that share the same concurrency pattern:
backend.yml, backend_pr.yml, image.yml, image-pr.yml, lint.yml,
test.yml, test-extra.yml, tests-e2e.yml, tests-aio.yml,
tests-ui-e2e.yml, generate_intel_image.yaml.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 08:30:55 +00:00
Ettore Di Giacinto
05d6383393 Change vibevoice.cpp repository reference
Updated repository reference for vibevoice.cpp in bump_deps.yaml.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-09 10:30:11 +02:00
Ettore Di Giacinto
733c254b32 ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow (#9731)
The bespoke llama-cpp-darwin + llama-cpp-darwin-publish top-level jobs
in backend.yml ran unconditionally on every backend.yml trigger
(push/cron), bypassing the path filter that all 34 other Darwin
backends already honor via backend-jobs-darwin -> backend_build_darwin.yml.

Move llama-cpp into the includeDarwin matrix:
- New entry in .github/backend-matrix.yml (lang=go, no build-type).
- backend_build_darwin.yml gains an `if: inputs.backend == 'llama-cpp'`
  build step that drives `make backends/llama-cpp-darwin`. The bespoke
  script (scripts/build/llama-cpp-darwin.sh) compiles three CMake
  variants from backend/cpp/llama-cpp and bundles dylibs via otool, so
  it doesn't fit the build-darwin-go-backend mold; the existing
  llama-cpp-aware ccache setup blocks already in this workflow are
  what motivated the consolidation in the first place.
- scripts/changed-backends.js's inferBackendPathDarwin gains a special
  case so llama-cpp on Darwin maps to backend/cpp/llama-cpp/ (the C++
  source tree) rather than the non-existent backend/go/llama-cpp/.
- Bumps Darwin go-version from 1.24.x -> 1.25.x in backend.yml and
  backend_pr.yml so llama-cpp keeps the Go toolchain it had under the
  bespoke job; the other 34 Darwin backends pick this up too with no
  known reason to pin 1.24.
- Removes ~80 lines of bespoke YAML from backend.yml.

The publish path is unchanged in shape - every Darwin backend now uses
the same crane-push leg from ubuntu-latest in
backend_build_darwin.yml; only the build target differs per backend.

After this commit, llama-cpp-darwin only rebuilds when
backend/cpp/llama-cpp/ is touched (verified locally) - same behavior
as every other Darwin backend.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 10:18:17 +02:00
LocalAI [bot]
4542833cb4 chore: ⬆️ Update ggml-org/llama.cpp to 9f5f0e689c9e977e5f23a27e344aa36082f44738 (#9724)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 10:18:05 +02:00
LocalAI [bot]
f0374aa0e8 ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) (#9730)
* ci: add per-arch + manifest-merge support for LocalAI server image

Mirror the backend_build.yml + backend_merge.yml pattern shipped in
PR #9726 for the LocalAI server image:

- image_build.yml accepts optional platform-tag (default ''), scopes
  registry cache to cache-localai<suffix>-<platform-tag>, and pushes
  by canonical digest only on push events. Digests upload as artifacts
  named digests-localai<suffix>-<platform-tag>, with a "-core"
  placeholder when tag-suffix is empty so the merge job's download
  pattern doesn't over-match across multiple suffixes.
- image_merge.yml is a new reusable workflow that downloads matching
  digest artifacts and assembles the final tagged manifest list via
  docker buildx imagetools create.

Image names differ from backend_*.yml: the LocalAI server is published
under quay.io/go-skynet/local-ai and localai/localai (not -backends).

Not yet wired into image.yml / image-pr.yml — Commit C does that.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: fan out per-arch split to remaining 34 backends

Convert all remaining linux/amd64,linux/arm64 entries in
backend-matrix.yml to per-arch + manifest-merge form. Each was a
single matrix entry running both arches on x86 under QEMU emulation;
each becomes two entries — amd64 on ubuntu-latest, arm64 on
ubuntu-24.04-arm (native).

Four backends that were on bigger-runner (-cpu-llama-cpp,
-cpu-turboquant, -gpu-vulkan-llama-cpp, -gpu-vulkan-turboquant) have
both legs moved to free tier as part of the same change. They are
compile-only (no torch/CUDA install) and fit comfortably with the
setup-build-disk /mnt relocation. Phase 4 (next commit) retires the
remaining 5 single-arch bigger-runner entries.

After this commit:
- 271 total matrix entries (was 237)
- 0 multi-arch entries left
- 36 per-arch pairs (34 new + 2 pilots from PR #9727)
- 5 bigger-runner entries remaining (single-arch, Phase 4 target)

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: split LocalAI image multi-arch entries per arch + merge

Mirror the backend per-arch split for the main LocalAI image:

- image.yml's core-image-build matrix: split the core ('') and
  -gpu-vulkan entries into amd64 + arm64 legs each. amd64 on
  ubuntu-latest, arm64 on ubuntu-24.04-arm (native).
- New top-level core-image-merge and gpu-vulkan-image-merge jobs
  call image_merge.yml after core-image-build completes.
- image-pr.yml's image-build matrix: split the -vulkan-core entry.
  No merge job added on the PR side — image_build.yml's digest-push
  is push-only-event-gated, so a PR-side merge would have nothing
  to download.

After this commit, no workflow file references
linux/amd64,linux/arm64 in a single matrix slot.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: retire bigger-runner from backend matrix (Phase 4)

Migrate the remaining 5 single-arch bigger-runner entries to
ubuntu-latest. Combined with the Phase 3 setup-build-disk /mnt
relocation (PR #9726), free-tier ubuntu-latest now has ~100 GB of
working space — enough for ROCm dev image (~16 GB), CUDA toolkit
(~5 GB), and the per-backend compile/install steps these entries do.

Backends migrated:
- -gpu-nvidia-cuda-12-llama-cpp
- -gpu-nvidia-cuda-12-turboquant
- -gpu-rocm-hipblas-faster-whisper
- -gpu-rocm-hipblas-coqui
- -cpu-ik-llama-cpp

After this commit, .github/backend-matrix.yml has zero bigger-runner
references. The bigger-runner used in tests-vibevoice-cpp-grpc-
transcription (test-extra.yml) is a separate concern handled in a
follow-up.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 9 Intel oneAPI backends to free tier (Phase 5.1)

Intel oneAPI base image is ~6 GB; each backend's wheel install
stays well within the ~100 GB working space provided by Phase 3's
setup-build-disk /mnt relocation. Lowest-risk batch of the
arc-runner-set retirement.

Backends migrated:
  vllm, sglang, vibevoice, qwen-asr, nemo, qwen-tts,
  fish-speech, voxcpm, pocket-tts (all -gpu-intel-* variants).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 15 ROCm Python backends to free tier (Phase 5.2)

ROCm dev image (~16 GB) plus per-backend torch/wheels install fits
on ubuntu-latest with the /mnt-relocated Docker root. These entries
include the heavier vLLM/sglang/transformers/diffusers stack on
ROCm; if any specific backend OOMs or runs out of disk, individual
flips back to arc-runner-set are revertable per-entry.

Backends migrated: all 15 -gpu-rocm-hipblas-* entries previously on
arc-runner-set (vllm/vllm-omni/sglang/transformers/diffusers/
ace-step/kokoro/vibevoice/qwen-asr/nemo/qwen-tts/fish-speech/
voxcpm/pocket-tts/neutts).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: migrate 6 CUDA Python backends to free tier (Phase 5.3)

vLLM/sglang stacks on CUDA 12 and CUDA 13 are the heaviest
backends in the matrix — flash-attn intermediate layers can spike
disk usage during build. setup-build-disk's /mnt relocation gives
~100 GB working space which fits the documented peak.

Highest-risk batch of the arc-runner-set retirement; if any
backend fails to build on free tier, the per-entry runs-on flip
is the unit of revert.

Backends migrated: -gpu-nvidia-cuda-{12,13}-{vllm,vllm-omni,sglang}.

After this commit, .github/backend-matrix.yml has zero references
to arc-runner-set or bigger-runner. The migration is complete.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: disable provenance on multi-registry digest pushes

Root-caused on master via PR #9727's pilot: when docker/build-push-action@v7
pushes a single build to TWO registries simultaneously with
push-by-digest=true, buildx generates a per-registry provenance
attestation manifest (because mode=max — the default for push:true —
includes the runner ID). That makes the resulting manifest-list digest
diverge across registries:

  arm64 -cpu-faster-whisper build:
    image manifest:        sha256:d3bdd34b... (identical, content-only)
    quay manifest list:    sha256:66b4cfc8... (with quay attestation)
    dockerhub manifest list: sha256:e0733c3b... (with dockerhub attestation)

steps.build.outputs.digest returns only one of the list digests
(empirically the dockerhub one). The merge job then asks
"quay.io/...@sha256:e0733c3b..." which doesn't exist on quay — that
list has digest 66b4cfc8 there. Result: imagetools create fails with
"not found" and the merge job fails (run 25581983094, job 75110021491).

Setting provenance: false drops the per-registry attestation; the
manifest-list digest becomes pure content, identical across both
registries, and steps.build.outputs.digest works on either lookup.

Applied to backend_build.yml and image_build.yml — both refactored
to use the same multi-registry digest-push pattern in the prior PRs.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 09:37:00 +02:00
Ettore Di Giacinto
fe7b27eb66 test(ci): trigger faster-whisper rebuild to observe per-arch+merge
The PR that introduced the per-arch + manifest-merge pilot (#9727)
only touched CI infrastructure files, so the path filter correctly
skipped backend builds on its merge commit. To observe the new
backend-merge-jobs flow assemble a real manifest list, this commit
touches faster-whisper's Makefile so its two new per-arch entries
schedule and the merge job runs.

The trailing comment is the smallest possible diff and is harmless
to the build.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 22:09:46 +00:00
LocalAI [bot]
14a3275329 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 98950267c67fd95937a54ebd6e3c66cf2679b710 (#9725)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-09 00:06:05 +02:00
LocalAI [bot]
cb68cd1cf4 ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization (#9727)
ci: pilot per-arch split for faster-whisper and llama-cpp-quantization

Convert two backends from QEMU-emulated multi-arch (linux/amd64,linux/arm64
on a single ubuntu-latest) to native per-arch + manifest-list merge:
- amd64 leg on ubuntu-latest
- arm64 leg on ubuntu-24.04-arm (native, ~5-10x faster than emulated)
- merge job assembles both digests under the final tag via
  docker buildx imagetools create

Backends piloted:
- -cpu-faster-whisper (small Python, fast baseline)
- -cpu-llama-cpp-quantization (heavier compile path, stress test)

Infrastructure changes that the rest of Phase 2 (Tasks 2.5+) will reuse:
- .github/backend-matrix.yml entries gain a `platform-tag` field
  ('amd64'/'arm64') for matrix entries that participate in the split.
  Other entries omit it; backend_build.yml already defaults missing
  values to '' (empty cache key suffix preserved as cache<suffix>-).
- backend.yml + backend_pr.yml forward `platform-tag` from matrix to
  the reusable backend_build.yml.
- scripts/changed-backends.js groups filtered entries by tag-suffix
  and emits a `merge-matrix` (plus `has-merges`) for groups of size>=2.
  Singletons aren't merged.
- backend.yml + backend_pr.yml gain a `backend-merge-jobs` job that
  consumes merge-matrix and calls backend_merge.yml after backend-jobs.
  PR variant is also event-gated so the no-op-on-PR merge job doesn't
  even start.

The other 34 multi-arch entries are unchanged in this PR -- Task 2.5
fans out the same shape to them once the pilot is observed green.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-09 00:04:42 +02:00
LocalAI [bot]
624fa946f8 feat(swagger): update swagger (#9723)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 23:44:55 +02:00
LocalAI [bot]
1f313cfdb0 ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) (#9726)
* ci: extract free-disk-space composite action

Consolidate the apt-clean + dotnet/android/ghc/boost removal blocks from
backend_build.yml, image_build.yml, and test.yml into a single composite
action. The three callers had slightly different inline blocks; the
composite uses the more aggressive backend_build/image_build variant for
all three callers — test.yml jobs now also purge snapd, edge/firefox/
powershell/r-base-core, and sweep /opt/ghc + /usr/local/share/boost +
$AGENT_TOOLSDIRECTORY. Idempotent and skipped on self-hosted runners.

In test.yml, actions/checkout now runs before the composite action call
because the composite lives at ./.github/actions/free-disk-space and
requires a checked-out repo. The original ordering relied on
jlumbroso/free-disk-space@main being a remote action; this is the
minimum-invasive change to support a local composite.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: path-filter backend.yml master push

Run scripts/changed-backends.js on master pushes too (not just PRs) so
unrelated commits don't rebuild all ~210 backend container images. Tag
pushes still build the full matrix via FORCE_ALL.

Push events use the GitHub Compare API to diff event.before..event.after.
Edge cases (first push with zero base, API truncation beyond 300 files,
missing fields, network failure) fall back to "run everything" — better
safe than silently miss a backend.

The matrix literal moves from .github/workflows/backend.yml into a new
data-only file at .github/backend-matrix.yml (outside workflows/ so
actionlint doesn't try to parse it as a workflow). Both backend.yml and
backend_pr.yml now consume the dynamic matrix output uniformly via
fromJson(needs.generate-matrix.outputs.matrix); the script reads the
matrix from the new location.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: bound max-parallel on backend-jobs matrices

Cap to 8 concurrent jobs to avoid queue starvation on the shared GHA free
pool while migration is in flight. Lift after Phases 4-5 retire the
self-hosted runners. Also drops a leftover commented-out max-parallel
line that lived in backend.yml since the previous matrix shape.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: scope backend cache per arch, push by digest

Prepare backend_build.yml for the multi-arch split. The reusable
workflow now accepts a `platform-tag` input ("amd64" / "arm64") that
scopes the registry cache to cache<suffix>-<platform-tag> and (on push
events) pushes the resulting image by canonical digest only. Digests
are uploaded as artifacts named digests<suffix>-<platform-tag> for the
merge job (Task 2.2) to consume.

`platform-tag` is optional with empty default during the migration —
existing callers continue to work unchanged (their cache key just
becomes `cache<suffix>-`, an orphaned but valid key). Tasks 2.3+ will
update callers to pass an explicit "amd64" / "arm64" value. Phase 6
flips the input to required: true once every caller is wired.

PR builds keep their existing tag-based push to ci-tests but pick up
the per-arch cache key. Multi-arch PR builds remain emulated in this
commit; they migrate when the matrix entries split (Tasks 2.3+).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: add backend_merge.yml reusable workflow

Joins per-arch digest artifacts (uploaded by backend_build.yml when
called with platform-tag) into a single tagged multi-arch manifest list
via `docker buildx imagetools create`. Called once per backend by
backend.yml after both per-arch build jobs succeed.

The workflow generates final tags identically to the previous monolithic
build job (same docker/metadata-action invocation), so consumers of
quay.io/go-skynet/local-ai-backends and localai/localai-backends see no
tag-shape change. Two imagetools calls (one per registry) reference the
same per-arch digests under different image names.

Not yet wired into backend.yml — Tasks 2.3+ rewrite individual matrix
entries to expand into per-arch + merge jobs that call this workflow.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: relocate Docker data-root to /mnt on hosted runners

GHA hosted ubuntu-latest runners ship a ~75 GB /mnt drive that's unused
by default. Stopping Docker, rsync'ing /var/lib/docker to /mnt, and
restarting with data-root pointing there yields ~100 GB of working
space (combined with the apt-clean from Task 1.1) — enough for ROCm
dev image + vLLM torch install + flash-attn intermediate layers.

This is the structural change that lets Phases 4 and 5 of the migration
plan move the bigger-runner and arc-runner-set jobs onto ubuntu-latest.

The composite action is no-op on self-hosted runners (where /mnt isn't
expected) and on non-X64 runners (Task 3.2 verifies the arm64 hosted
pool's /mnt shape separately before enabling). Wired into both
backend_build.yml and image_build.yml between free-disk-space and the
first Docker operation.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(setup-build-disk): chmod 1777 /mnt/docker-tmp

buildx CLI runs as the unprivileged 'runner' user and creates config
dirs under TMPDIR before binding them into the buildkit container.
/mnt is root-owned by default, so the original mkdir produced a
permission-denied when buildx tried to write there:

  ERROR: mkdir /mnt/docker-tmp/buildkitd-config2740457204: permission denied

Mirror /tmp's permission mode (1777 — world-writable with sticky bit)
on /mnt/docker-tmp so non-root processes can stage their config.

Caught by the first PR run (image-build hipblas job) on PR #9726.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci: weekly full-matrix rebuild via cron

Path-filtering backend.yml master push (the previous commit's main
optimization) skips backends whose source didn't change. That broke
the DEPS_REFRESH cache-buster's coverage: the build-arg keyed on
%Y-W%V busts the install layer's cache on a new ISO week, but only
when the build actually runs. Untouched Python backends (torch,
transformers, vllm with no version pin) would otherwise ship stale
wheels indefinitely.

Add a Sunday 06:00 UTC cron that fires the full matrix. Schedule
events have no event.ref / event.before, so the script's changedFiles
== null fallback (scripts/changed-backends.js) emits the full matrix
automatically — no script change needed.

C++/Go backends with pinned deps cache-hit and complete fast, so the
weekly cost is dominated by Python re-resolves which is exactly what
we want.

workflow_dispatch added so a maintainer can trigger an ad-hoc
full-matrix rebuild without faking a tag push.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 23:43:41 +02:00
LocalAI [bot]
0c1f1e6cbd chore(model gallery): 🤖 add 1 new models via gallery agent (#9720)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 16:26:03 +02:00
Richard Palethorpe
670259ce43 chore: Security hardening (#9719)
* fix(http): close 0.0.0.0/[::] SSRF bypass in /api/cors-proxy

The CORS proxy carried its own private-network blocklist (RFC 1918 + a
handful of IPv6 ranges) instead of using the same classification as
pkg/utils/urlfetch.go. The hand-rolled list missed 0.0.0.0/8 and ::/128,
both of which Linux routes to localhost — so any user with FeatureMCP
(default-on for new users) could reach LocalAI's own listener and any
other service bound to 0.0.0.0:port via:

  GET /api/cors-proxy?url=http://0.0.0.0:8080/...
  GET /api/cors-proxy?url=http://[::]:8080/...

Replace the custom check with utils.IsPublicIP (Go stdlib IsLoopback /
IsLinkLocalUnicast / IsPrivate / IsUnspecified, plus IPv4-mapped IPv6
unmasking) and add an upfront hostname rejection for localhost, *.local,
and the cloud metadata aliases so split-horizon DNS can't paper over the
IP check.

The IP-pinning DialContext is unchanged: the validated IP from the
single resolution is reused for the connection, so DNS rebinding still
cannot swap a public answer for a private one between validate and dial.

Regression tests cover 0.0.0.0, 0.0.0.0:PORT, [::], ::ffff:127.0.0.1,
::ffff:10.0.0.1, file://, gopher://, ftp://, localhost, 127.0.0.1,
10.0.0.1, 169.254.169.254, metadata.google.internal.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(downloader): verify SHA before promoting temp file to final path

DownloadFileWithContext renamed the .partial file to its final name
*before* checking the streamed SHA, so a hash mismatch returned an
error but left the tampered file at filePath. Subsequent code that
operated on filePath (a backend launcher, a YAML loader, a re-download
that finds the file already present and skips) would consume the
attacker-supplied bytes.

Reorder: verify the streamed hash first, remove the .partial on
mismatch, then rename. The streamed hash is computed during io.Copy
so no second read is needed.

While here, raise the empty-SHA case from a Debug log to a Warn so
"this download had no integrity check" is visible at the default log
level. Backend installs currently pass through with no digest; the
warning makes that footprint observable without changing behaviour.

Regression test asserts os.IsNotExist on the destination after a
deliberate SHA mismatch.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(auth): require email_verified for OIDC admin promotion

extractOIDCUserInfo read the ID token's "email" claim but never
inspected "email_verified". With LOCALAI_ADMIN_EMAIL set, an attacker
who could register on the configured OIDC IdP under that email (some
IdPs accept self-supplied unverified emails) inherited admin role:

  - first login:  AssignRole(tx, email, adminEmail) → RoleAdmin
  - re-login:     MaybePromote(db, user, adminEmail) → flip to RoleAdmin

Add EmailVerified to oauthUserInfo, parse email_verified from the OIDC
claims (default false on absence so an IdP that omits the claim cannot
short-circuit the gate), and substitute "" for the role-decision email
when verified=false via emailForRoleDecision. The user record still
stores the unverified email for display.

GitHub's path defaults EmailVerified=true: GitHub only returns a public
profile email after verification, and fetchGitHubPrimaryEmail explicitly
filters to Verified=true.

Regression tests cover both the helper contract and integration with
AssignRole, including the bootstrap "first user" branch that would
otherwise mask the gate.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(cli): refuse public bind when no auth backend is configured

When neither an auth DB nor a static API key is set, the auth
middleware passes every request through. That is fine for a developer
laptop, a home LAN, or a Tailnet — the network itself is the trust
boundary. It is not fine on a public IP, where every model install,
settings change, and admin endpoint becomes reachable from the
internet.

Refuse to start in that exact configuration. Loopback, RFC 1918,
RFC 4193 ULA, link-local, and RFC 6598 CGNAT (Tailscale's default
range) all count as trusted; wildcard binds (`:port`, `0.0.0.0`,
`[::]`) are accepted only when every host interface is in one of those
ranges. Hostnames are resolved and treated as trusted only when every
answer is.

A new --allow-insecure-public-bind / LOCALAI_ALLOW_INSECURE_PUBLIC_BIND
flag opts out for deployments that gate access externally (a reverse
proxy enforcing auth, a mesh ACL, etc.). The error message lists this
plus the three constructive alternatives (bind a private interface,
enable --auth, set --api-keys).

The interface enumeration goes through a package-level interfaceAddrsFn
var so tests can simulate cloud-VM, home-LAN, Tailscale-only, and
enumeration-failure topologies without poking at the real network
stack.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): regression-test the localai_assistant admin gate

ChatEndpoint already rejects metadata.localai_assistant=true from a
non-admin caller, but the gate was open-coded inline with no direct
test coverage. The chat route is FeatureChat-gated (default-on), and
the assistant's in-process MCP server can install/delete models and
edit configs — the wrong handler change would silently turn the LLM
into a confused deputy.

Extract the gate into requireAssistantAccess(c, authEnabled) and pin
its behaviour: auth disabled is a no-op, unauthenticated is 403,
RoleUser is 403, RoleAdmin and the synthetic legacy-key admin are
admitted.

No behaviour change in the production path.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): assert every API route is auth-classified

The auth middleware classifies path prefixes (/api/, /v1/, /models/,
etc.) as protected and treats anything else as a static-asset
passthrough. A new endpoint shipped under a brand-new prefix — or a
new path that simply isn't on the prefix allowlist — would be
reachable anonymously.

Walk every route registered by API() with auth enabled and a fresh
in-memory database (no users, no keys), and assert each API-prefixed
route returns 401 / 404 / 405 to an anonymous request. Public surfaces
(/api/auth/*, /api/branding, /api/node/* token-authenticated routes,
/healthz, branding asset server, generated-content server, static
assets) are explicit allowlist entries with comments justifying them.

Build-tagged 'auth' so it runs against the SQLite-backed auth DB
(matches the existing auth suite).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): pin agent endpoint per-user isolation contract

agents.go's getUserID / effectiveUserID / canImpersonateUser /
wantsAllUsers helpers are the single trust boundary for cross-user
access on agent, agent-jobs, collections, and skills routes. A
regression there is the difference between "regular user reads their
own data" and "regular user reads anyone's data via ?user_id=victim".

Lock in the contract:
  - effectiveUserID ignores ?user_id= for unauthenticated and RoleUser
  - effectiveUserID honours it for RoleAdmin and ProviderAgentWorker
  - wantsAllUsers requires admin AND the literal "true" string
  - canImpersonateUser is admin OR agent-worker, never plain RoleUser

No production change — this commit only adds tests.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(downloader): drop redundant stat in removePartialFile

The stat-then-remove pattern is a TOCTOU window and a wasted syscall —
os.Remove already returns ErrNotExist for the missing-file case, so trust
that and treat it as a no-op.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): redact secrets from trace buffer and distribution-token logs

The /api/traces buffer captured Authorization, Cookie, Set-Cookie, and
API-key headers verbatim from every request when tracing was enabled. The
endpoint is admin-only but the buffer is reachable via any heap-style
introspection and the captured tokens otherwise outlive the request.
Strip those header values at capture time. Body redaction is left to a
follow-up — the prompts are usually the operator's own and JSON-walking
is invasive.

Distribution tokens were also logged in plaintext from
core/explorer/discovery.go; logs forward to syslog/journald and outlive
the token. Redact those to a short prefix/suffix instead.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): rate-limit OAuth callbacks separately from password endpoints

The shared 5/min/IP limit on auth endpoints is right for password-style
flows but too tight for OAuth callbacks: corporate SSO funnels many real
users through one outbound IP and would trip the limit. Add a separate
60/min/IP limiter for /api/auth/{github,oidc}/callback so callbacks are
bounded against floods without breaking shared-IP deployments.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(gallery): verify backend tarball sha256 when set in gallery entry

GalleryBackend gained an optional sha256 field; the install path now
threads it through to the existing downloader hash-verify (which already
streams, verifies, and rolls back on mismatch). Galleries without sha256
keep working; the empty-SHA path still emits the existing
"downloading without integrity check" warning.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(http): pin CSRF coverage on multipart endpoints

The CSRF middleware in app.go is global (e.Use) so it covers every
multipart upload route — branding assets, fine-tune datasets, audio
transforms, agent collections. Pin that contract: cross-site multipart
POSTs are rejected; same-origin / same-site / API-key clients are not.
Also pins the SameSite=Lax fallback path the skipper relies on when
Sec-Fetch-Site is absent.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(http): XSS hardening — CSP headers, safe href, base-href escape, SVG sandbox

Several closely related XSS-prevention changes spanning the SPA shell, the
React UI, and the branding asset server:

- New SecurityHeaders middleware sets CSP, X-Content-Type-Options,
  X-Frame-Options, and Referrer-Policy on every response. The CSP keeps
  script-src permissive because the Vite bundle relies on inline + eval'd
  scripts; tightening that requires moving to a nonce-based policy.

- The <base href> injection in the SPA shell escaped attacker-controllable
  Host / X-Forwarded-Host headers — a single quote in the host header
  broke out of the attribute. Pass through SecureBaseHref (html.EscapeString).

- Three React sinks rendering untrusted content via dangerouslySetInnerHTML
  switch to text-node rendering with whiteSpace: pre-wrap: user message
  bodies in Chat.jsx and AgentChat.jsx, and the agent activity log in
  AgentChat.jsx. The hand-rolled escape on the agent user-message variant
  is replaced by the same plain-text path.

- New safeHref util collapses non-allowlisted URI schemes (most
  importantly javascript:) to '#'. Applied to gallery `<a href={url}>`
  links in Models / Backends / Manage and to canvas artifact links —
  these come from gallery JSON or assistant tool calls and must be treated
  as untrusted.

- The branding asset server attaches a sandbox CSP plus same-origin CORP
  to .svg responses. The React UI loads logos via <img>, but the same URL
  is also reachable via direct navigation; this prevents script
  execution if a hostile SVG slipped past upload validation.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(http): bound HTTP server with read-header and idle timeouts

A net/http server with no timeouts is trivially Slowloris-able and leaks
idle keep-alive connections. Set ReadHeaderTimeout (30s) to plug the
slow-headers attack and IdleTimeout (120s) to cap keep-alive sockets.

ReadTimeout and WriteTimeout stay at 0 because request bodies can be
multi-GB model uploads and SSE / chat completions stream for many
minutes; operators who need tighter per-request bounds should terminate
slow clients at a reverse proxy.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(auth): pin PUT /api/auth/profile field-tampering contract

The handler uses an explicit local body struct (only name and avatar_url)
plus a gorm Updates(map) with a column allowlist, so an attacker posting
{"role":"admin","email":"...","password_hash":"..."} can't mass-assign
those fields. Lock that down with a regression test so a future
"let's just c.Bind(&user)" refactor breaks loudly.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(services): strip directory components from multipart upload filenames

UploadDataset and UploadToCollectionForUser took the raw multipart
file.Filename and joined it into a destination path. The fine-tune
upload was incidentally safe because of a UUID prefix that fused any
leading '..' to a literal segment, but the protection is fragile.
UploadToCollectionForUser handed the filename to a vendored backend
without sanitising at all.

Strip to filepath.Base at both boundaries and reject the trivial
unsafe values ("", ".", "..", "/").

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): validate persisted MCP server entries on load

localStorage is shared across same-origin pages; an XSS that lands once
can poison persisted MCP server config to attempt header injection or
to feed a non-http URL into the fetch path on subsequent loads.
Validate every entry: types must match, URL must parse with http(s)
scheme, header keys/values must be control-char-free. Drop anything
that doesn't fit.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): close X-Forwarded-Prefix open redirect

The reverse-proxy support concatenated X-Forwarded-Prefix into the
redirect target without validation, so a forged header value of
"//evil.com" turned the SPA-shell redirect helper at /, /browse, and
/browse/* into a 301 to //evil.com/app. The path-strip middleware had
the same shape on its prefix-trailing-slash redirect.

Add SafeForwardedPrefix at the middleware boundary: must start with
a single '/', no protocol-relative '//' opener, no scheme, no
backslash, no control characters. Apply at both consumers; misconfig
trips the validator and the header is dropped.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(http): refuse wildcard CORS when LOCALAI_CORS=true with empty allowlist

When LOCALAI_CORS=true but LOCALAI_CORS_ALLOW_ORIGINS was empty, Echo's
CORSWithConfig saw an empty allow-list and fell back to its default
AllowOrigins=["*"]. An operator who flipped the strict-CORS feature
flag without populating the list got the opposite of what they asked
for. Echo never sets Allow-Credentials: true so this isn't directly
exploitable (cookies aren't sent under wildcard CORS), but the
misconfiguration trap is worth closing. Skip the registration and warn.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): zxcvbn password strength check with user-acknowledged override

The previous policy was len < 8, which let through "Password1" and the
rest of the credential-stuffing corpus. LocalAI has no second factor
yet, so the bar needs to sit higher.

Add ValidatePasswordStrength using github.com/timbutler/zxcvbn (an
actively-maintained fork of the trustelem port; v1.0.4, April 2024):
- min 12 chars, max 72 (bcrypt's truncation point)
- reject NUL bytes (some bcrypt callers truncate at the first NUL)
- require zxcvbn score >= 3 ("safely unguessable, ~10^8 guesses to
  break"); the hint list ["localai", "local-ai", "admin"] penalises
  passwords built from the app's own branding

zxcvbn produces false positives sometimes (a strong-looking password
that happens to match a dictionary word) and operators occasionally
need to set a known-weak password (kiosk demos, CI rigs). Add an
acknowledgement path: PasswordPolicy{AllowWeak: true} skips the
entropy check while still enforcing the hard rules. The structured
PasswordErrorResponse marks weak-password rejections as Overridable
so the UI can surface a "use this anyway" checkbox.

Wired through register, self-service password change, and admin
password reset on both the server and the React UI.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): drop HTML5 minLength on new-password inputs

minLength={12} on the new-password input let the browser block the
form submit silently before any JS or network call ran. The browser
focused the field, showed a brief native tooltip, and that was that —
no toast, no fetch, no clue. Reproducible by typing fewer than 12
chars on the second password change of a session.

The JS-level length check in handleSubmit already shows a toast and
the server rejects with a structured error, so the HTML5 attribute
was redundant defence anyway. Drop it.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): bundle Geist fonts locally instead of fetching from Google

The new CSP correctly refused to apply styles from
fonts.googleapis.com because style-src is locked to 'self' and
'unsafe-inline'. Loosening the CSP would defeat its purpose; the
right fix is to stop reaching out to a third-party CDN for fonts on
every page load.

Add @fontsource-variable/geist and @fontsource-variable/geist-mono as
npm deps and import them once at boot. Drop the <link rel="preconnect">
and external stylesheet from index.html.

Side benefit: no third-party tracking via Referer / IP on every UI
load, no failure mode when offline / behind a captive portal.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(react-ui): refresh i18n strings to reflect 12-char password minimum

The translations still said "at least 8 characters" everywhere — the
client-side toast on a too-short password change told the user the
wrong floor. Update tooShort and newPasswordPlaceholder /
newPasswordDescription across all five locales (en, es, it, de,
zh-CN) to match the real ValidatePasswordStrength rule.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(auth): make password length-floor overridable like the entropy check

The 12-char minimum was a policy choice, not a technical invariant —
only "non-empty", "<= 72 bytes", and "no NUL bytes" are real bcrypt
constraints. Treating length-12 as a hard rule was inconsistent with
the entropy check (already overridable) and friction for use cases
where the account is just a name on a session, not a security
boundary (single-user kiosk, CI rig, lab demo).

Restructure ValidatePasswordStrength:
- Hard rules (always enforced): non-empty, <= MaxPasswordLength, no NUL byte
- Policy rules (skipped when AllowWeak=true): length >= 12, zxcvbn score >= 3

PasswordError now marks password_too_short as Overridable too. The
React forms generalised from `error_code === 'password_too_weak'` to
`overridable === true`, and the JS-side preflight length checks were
removed (server is source of truth, returns the same checkbox flow).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-08 16:25:45 +02:00
LocalAI [bot]
e5d7b84216 fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717)
* feat(messaging): add backend.upgrade NATS subject + payload types

Splits the slow force-reinstall path off backend.install so it can run on
its own subscription goroutine, eliminating head-of-line blocking between
routine model loads and full gallery upgrades.

Wire-level Force flag on BackendInstallRequest is kept for one release as
the rolling-update fallback target; doc note marks it deprecated.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): add per-backend mutex helper to backendSupervisor

Different backend names lock independently; same backend serializes. This
is the synchronization primitive used by the upcoming concurrent install
handler — without it, wrapping the NATS callback in a goroutine would
race the gallery directory when two requests target the same backend.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed/worker): run backend.install handler in a goroutine

NATS subscriptions deliver messages serially on a single per-subscription
goroutine. With a synchronous install handler, a multi-minute gallery
download would head-of-line-block every other install request to the
same worker — manifesting upstream as a 5-minute "nats: timeout" on
unrelated routine model loads.

The body now runs in its own goroutine, with a per-backend mutex
(lockBackend) protecting the gallery directory from concurrent operations
on the same backend. Different backend names install in parallel.

Backward-compat: req.Force=true is still honored here, so an older master
that hasn't been updated to send on backend.upgrade keeps working.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed/worker): subscribe to backend.upgrade as a separate path

Slow force-reinstall now lives on its own NATS subscription, so a
multi-minute gallery pull cannot head-of-line-block the routine
backend.install handler on the same worker. Same per-backend mutex
guards both — concurrent install + upgrade for the same backend
serialize at the gallery directory; different backends are independent.

upgradeBackend stops every live process for the backend, force-installs
from gallery, and re-registers. It does not start a new process — the
next backend.install will spawn one with the freshly-pulled binary.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend

Master now sends to backend.upgrade for force-reinstall, with a
nats.ErrNoResponders fallback to the legacy backend.install Force=true
path so a rolling update with a new master + an old worker still
converges. The Force parameter leaves the public Go API surface
entirely — only the internal fallback sets it on the wire.

InstallBackend timeout drops 5min -> 3min (most replies are sub-second
since the worker short-circuits on already-running or already-installed).
UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi
gallery pulls.

Updates the admin install HTTP endpoint
(core/http/endpoints/localai/nodes.go) to the new signature too.

router_test.go's fakeUnloader does not yet implement the new interface
shape; Task 3.2 will catch it up before the next package-level test run.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): update fakeUnloader for new NodeCommandSender shape

InstallBackend lost its force bool param (Force is not part of the public
Go API anymore — only the internal upgrade-fallback path sets it on the
wire). UpgradeBackend gained a method. Fake records both call slices and
provides an installHook concurrency seam for upcoming singleflight tests.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): cover UpgradeBackend's new subject + rolling-update fallback

Task 3.1 changed the master to publish UpgradeBackend on the new
backend.upgrade subject; the existing UpgradeBackend tests scripted the
old install subject and so all 3 began failing as expected. Updates them
to script SubjectNodeBackendUpgrade with BackendUpgradeReply.

Adds two new specs for the rolling-update fallback:
  - ErrNoResponders on backend.upgrade triggers a backend.install
    Force=true retry on the same node.
  - Non-NoResponders errors propagate to the caller unchanged.

scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and
scriptReplyMatching (predicate-matched canned reply, used to assert that
the fallback path actually sets Force=true on the install retry).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): coalesce concurrent identical backend.install via singleflight

Six simultaneous chat completions for the same not-yet-loaded model were
observed firing six independent NATS install requests, each serializing
through the worker's per-subscription goroutine and amplifying queue
depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group
keyed by (nodeID, backend, modelID, replica): N concurrent identical
loads share one round-trip and one reply.

Distinct (modelID, replica) keys still fire independent calls, so
multi-replica scaling and multi-model fan-out are unaffected.

fakeUnloader gains a sync.Mutex around its recording slices to keep
concurrent test goroutines race-clean.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(e2e/distributed): drop force arg from InstallBackend test calls

Two e2e test call sites still passed the trailing force bool that was
removed from RemoteUnloaderAdapter.InstallBackend in 9bde76d7. Caught
by golangci-lint typecheck on the upgrade-split branch (master CI was
already green because these tests don't run in the standard test path).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract worker business logic to core/services/worker

core/cli/worker.go grew to 1212 lines after the backend.upgrade split.
The CLI package was carrying backendSupervisor, NATS lifecycle handlers,
gallery install/upgrade orchestration, S3 file staging, and registration
helpers — all distributed-worker business logic that doesn't belong in
the cobra surface.

Move it to a new core/services/worker package, mirroring the existing
core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go
shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and
delegates Run.

No behavior change. All symbols stay unexported except Config and Run.
The three worker-specific tests (addr/replica/concurrency) move with
the code via git mv so history follows them.

Files split as:
  worker.go        - Run entry point
  config.go        - Config struct (kong tags retained, kong not imported)
  supervisor.go    - backendProcess, backendSupervisor, process lifecycle
  install.go       - installBackend, upgradeBackend, findBackend, lockBackend
  lifecycle.go     - subscribeLifecycleEvents (verbatim, decomposition is
                     a follow-up commit)
  file_staging.go  - subscribeFileStaging, isPathAllowed
  registration.go  - advertiseAddr, registrationBody, heartbeatBody, etc.
  reply.go         - replyJSON
  process_helpers.go - readLastLinesFromFile

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed/worker): decompose subscribeLifecycleEvents into per-event handlers

The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions
inline. Each grew context-shaped doc comments mixed with subscription
plumbing, making it hard to read any one handler without scrolling past the
others. Extract each handler into its own method on *backendSupervisor; the
subscriber becomes a thin 8-line dispatcher.

No behavior change: each method body is byte-equivalent to its corresponding
inline goroutine + handler. Doc comments that were attached to the inline
SubscribeReply calls migrate to the new method godocs.

Adding the next NATS subject is now a 2-line patch to the dispatcher plus
one new method, instead of grafting onto a monolith.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 16:24:54 +02:00
LocalAI [bot]
6070aafc69 chore(deps): bump LocalAGI for collection rehydrate-on-init-failure fix (#9721)
Picks up mudler/LocalAGI#? (commit 941ac52, merged into main):
in-process collections backend now registers a placeholder for every
on-disk collection at startup — even when the engine wrapper fails to
construct (typically because the embedding model is briefly
unreachable) — and rehydrates lazily on first access. Previously a
transient outage at LocalAI boot silently dropped every existing
collection from the agent pool's in-memory map; users saw "collection
not found" indefinitely until LocalAI was restarted, even after the
embedding service recovered. With this bump the next request to the
collection rehydrates it transparently.

Does not address the deeper LocalRecall issue where
NewPersistentPostgresCollection probes the embedding model at engine
construction even for read-only paths — that needs a separate fix in
mudler/localrecall.

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 16:24:42 +02:00
LocalAI [bot]
2be07f61da feat(whisper): honor client cancellation via ggml abort_callback (#9710)
* refactor(transcription): propagate request ctx through ModelTranscription*

Replaces context.Background() with the HTTP request ctx so client
disconnects start cancelling the gRPC call. No backend-side abort wiring
yet — that comes in a later commit. Pure plumbing.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cli): pass ctx to backend.ModelTranscription

Follow-up to e65d3e1f which threaded ctx through ModelTranscription
but missed the CLI caller. CLI commands have no request-scoped ctx,
so context.Background() is correct here.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform

Same ctx-plumbing pattern applied to the rest of the audio path. CLI
callers use context.Background() since there is no request scope; HTTP
callers use c.Request().Context().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths

Replaces remaining context.Background() sites in core/backend with the
caller's ctx. After this commit, every core/backend/*.go entry point
threads the request ctx end-to-end to the gRPC client.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream}

Adds context.Context as first parameter to the AIModel interface methods
that wrap whisper-style transcription. Server-side gRPC handler now
forwards the per-RPC ctx (server-streaming uses stream.Context()).
Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter;
none uses it yet — the actual cancellation primitive lands in the next
commit so this is pure plumbing.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): add abort_callback hook in the C++ bridge

Installs a std::atomic<int> flag, wires it into
whisper_full_params.abort_callback, and exposes a set_abort(int) C
symbol so Go can flip the flag from a goroutine watching the request
context. transcribe() now distinguishes abort (return 2) from real
whisper_full failure (return 1).

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): register set_abort symbol in the purego loader

Adds the Go-side binding for the new C export so the next commit can
call CppSetAbort(1) from a watcher goroutine on ctx.Done().

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(whisper): honor ctx cancellation and return codes.Canceled

A watcher goroutine watches ctx.Done() during AudioTranscription and
calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return
true at the next compute graph step, returns non-zero, and the bridge
returns 2 -> AudioTranscription maps that to codes.Canceled.

Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH)
that asserts cancellation latency under 5s and proves the abort flag
resets cleanly so the next transcription succeeds.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): join the cancel watcher goroutine before returning

Follow-up to 85edf9d2. The previous commit used `defer close(done)` and
called the watcher "joined synchronously" — but close() only signals,
it does not block until the goroutine exits. That left a window where
a late CppSetAbort(1) from a cancelled call could land on the next
call, after its C-side g_abort reset but before whisper_full() began
polling the abort callback, corrupting the second transcription.

Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher
has actually returned from its select.

Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription

If ctx is already Done() at entry, return codes.Canceled immediately
instead of running the full transcription. The C-side g_abort reset
happens at the start of transcribe() and would otherwise overwrite a
watcher-set abort flag from an already-cancelled ctx, producing a
spurious successful transcription on a request the client has already
abandoned.

Assisted-by: Claude:claude-haiku-4-5
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(tests/distributed): update testLLM mock for new AudioTranscription signature

Phase B (93c48e19) added context.Context to AIModel.AudioTranscription
but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint
caught it: *testLLM did not implement grpc.AIModel because the method
signature lacked the ctx parameter, which broke the distributed test
suite compilation and cascaded through every backend-build job that
runs `go build ./...`.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(whisper): port cancellation test to Ginkgo/Gomega

Project policy (.agents/coding-style.md, enforced by golangci-lint
forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no
stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the
cancellation test to a Describe/It block with Skip(...) for env
gating and Expect/HaveOccurred for assertions.

Same coverage: cancel mid-flight returns codes.Canceled within 5s and
a follow-up transcription succeeds, proving the C-side g_abort flag
resets cleanly.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-08 01:44:47 +02:00
LocalAI [bot]
806130bbc0 chore: ⬆️ Update ggml-org/whisper.cpp to c81b2dabbc45484dee2ca6658cfe39c841df5c70 (#9712)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:32 +02:00
LocalAI [bot]
3b84582567 chore: ⬆️ Update ggml-org/llama.cpp to 05ff59cb57860cc992fc6dcede32c696efea711c (#9714)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:44:17 +02:00
LocalAI [bot]
907929ce60 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 9a26522af234f8db079ae3735f35ab6c20fe2c66 (#9713)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-08 01:43:44 +02:00
Ettore Di Giacinto
e6916ae9b1 fix(gallery/flux.2): enable VAE encoder so image edits actually work
stable-diffusion.cpp loads the VAE encoder weights only when the ctx is
created with vae_decode_only=false. Our gosd wrapper defaults that flag
to true, and the upstream flux2/flux2-klein code paths don't auto-flip
it (sd_version_is_unet_edit / sd_version_is_control both return false
for VERSION_FLUX2 and VERSION_FLUX2_KLEIN). The CLI compensates by
flipping the flag whenever -r/--ref-image is passed, but on the server
side we don't know that at load time.

Result: requests to /v1/images/generations with `ref_images` against
flux.2-dev / flux.2-klein-{4b,9b} would silently skip the encode step
(first_stage_model.encoder + tae.encoder are dropped at load via
ignore_tensors), and the output had no relation to the reference
image — image-editing was effectively unsupported even though
stable-diffusion.cpp itself supports it.

Add `vae_decode_only:false` to the options of all three flux.2 gallery
entries so the encoder is loaded on first launch. Verified on a live
instance: a 64x64 striped reference image produces an output that
preserves the 3-band layout, confirming the VAE encoder is now wired
through.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 23:18:08 +00:00
Ettore Di Giacinto
3bc5ae8da6 fix(tests/e2e-backends): bump ctx_size for llama-cpp transcription
Qwen3-ASR-0.6B encodes the jfk.wav fixture into 777 audio tokens via
its mmproj, but the test harness defaulted BACKEND_TEST_CTX_SIZE to
512, so llama.cpp server rejected every transcription request with
"request (777 tokens) exceeds the available context size (512 tokens)".

Set BACKEND_TEST_CTX_SIZE=2048 on the llama-cpp transcription target
only — sherpa-onnx and vibevoice transcription targets don't go
through llama.cpp's slot/n_ctx and weren't failing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 22:31:08 +00:00
dependabot[bot]
3234e6d6ba chore(deps): bump the go_modules group across 1 directory with 8 updates (#9705)
Bumps the go_modules group with 8 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [github.com/buger/jsonparser](https://github.com/buger/jsonparser) | `1.1.1` | `1.1.2` |
| [github.com/antchfx/xpath](https://github.com/antchfx/xpath) | `1.3.4` | `1.3.6` |
| [github.com/cloudflare/circl](https://github.com/cloudflare/circl) | `1.6.1` | `1.6.3` |
| [github.com/go-git/go-git/v5](https://github.com/go-git/go-git) | `5.16.4` | `5.18.0` |
| [github.com/gofiber/fiber/v2](https://github.com/gofiber/fiber) | `2.52.11` | `2.52.13` |
| [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) | `5.8.0` | `5.9.2` |
| [golang.org/x/image](https://github.com/golang/image) | `0.25.0` | `0.38.0` |
| [github.com/ipld/go-ipld-prime](https://github.com/ipld/go-ipld-prime) | `0.21.0` | `0.23.0` |



Updates `github.com/buger/jsonparser` from 1.1.1 to 1.1.2
- [Release notes](https://github.com/buger/jsonparser/releases)
- [Commits](https://github.com/buger/jsonparser/compare/v1.1.1...v1.1.2)

Updates `github.com/antchfx/xpath` from 1.3.4 to 1.3.6
- [Release notes](https://github.com/antchfx/xpath/releases)
- [Commits](https://github.com/antchfx/xpath/compare/v1.3.4...v1.3.6)

Updates `github.com/cloudflare/circl` from 1.6.1 to 1.6.3
- [Release notes](https://github.com/cloudflare/circl/releases)
- [Commits](https://github.com/cloudflare/circl/compare/v1.6.1...v1.6.3)

Updates `github.com/go-git/go-git/v5` from 5.16.4 to 5.18.0
- [Release notes](https://github.com/go-git/go-git/releases)
- [Changelog](https://github.com/go-git/go-git/blob/main/HISTORY.md)
- [Commits](https://github.com/go-git/go-git/compare/v5.16.4...v5.18.0)

Updates `github.com/gofiber/fiber/v2` from 2.52.11 to 2.52.13
- [Release notes](https://github.com/gofiber/fiber/releases)
- [Commits](https://github.com/gofiber/fiber/compare/v2.52.11...v2.52.13)

Updates `github.com/jackc/pgx/v5` from 5.8.0 to 5.9.2
- [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jackc/pgx/compare/v5.8.0...v5.9.2)

Updates `golang.org/x/image` from 0.25.0 to 0.38.0
- [Commits](https://github.com/golang/image/compare/v0.25.0...v0.38.0)

Updates `github.com/ipld/go-ipld-prime` from 0.21.0 to 0.23.0
- [Release notes](https://github.com/ipld/go-ipld-prime/releases)
- [Changelog](https://github.com/ipld/go-ipld-prime/blob/master/CHANGELOG.md)
- [Commits](https://github.com/ipld/go-ipld-prime/compare/v0.21.0...v0.23.0)

---
updated-dependencies:
- dependency-name: github.com/antchfx/xpath
  dependency-version: 1.3.6
  dependency-type: indirect
- dependency-name: github.com/buger/jsonparser
  dependency-version: 1.1.2
  dependency-type: indirect
- dependency-name: github.com/cloudflare/circl
  dependency-version: 1.6.3
  dependency-type: indirect
- dependency-name: github.com/go-git/go-git/v5
  dependency-version: 5.18.0
  dependency-type: indirect
- dependency-name: github.com/gofiber/fiber/v2
  dependency-version: 2.52.13
  dependency-type: indirect
- dependency-name: github.com/ipld/go-ipld-prime
  dependency-version: 0.23.0
  dependency-type: indirect
- dependency-name: github.com/jackc/pgx/v5
  dependency-version: 5.9.2
  dependency-type: indirect
- dependency-name: golang.org/x/image
  dependency-version: 0.38.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-07 17:29:04 +02:00
LocalAI [bot]
595b6fd22d feat(api/transcription): include segments + duration + language on stream done event (#9709)
streamTranscription previously emitted a done event with just `text`,
matching the OpenAI streaming spec exactly. Streaming clients that need
per-utterance timings or audio duration had to fall back to the
non-streaming JSON path — and that path is exactly the one that trips
on ResponseHeaderTimeout when whisper requests queue behind each other
on a SingleThread backend.

Extend the done event to additively carry `language`, `duration`, and
a `segments` array (id, start, end, text — start/end as float seconds,
matching TranscriptionSegmentSeconds). Empty / zero values are still
omitted; spec-compliant clients ignore the new fields.

This unblocks notary's streaming Transcribe (companion change in the
notary repo) so it produces the same TranscriptionResult shape as the
JSON path while sidestepping the queue-induced header timeouts.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-07 17:28:26 +02:00
LocalAI [bot]
447c186089 fix(distributed): make backend upgrade actually re-install on workers (#9708)
* fix(distributed): make backend upgrade actually re-install on workers

UpgradeBackend dispatched a vanilla backend.install NATS event to every
node hosting the backend. The worker's installBackend short-circuits on
"already running for this (model, replica) slot" and returns the
existing address — so the gallery install path was skipped, no artifact
was re-downloaded, no metadata was written. The frontend's drift
detection then re-flagged the same backends every cycle (installedDigest
stays empty → mismatch → "Backend upgrade available (new build)") while
"Backend upgraded successfully" landed in the logs at the same time.
The user-visible symptom: clicking "Upgrade All" silently does nothing
and the same N backends sit on the upgrade list forever.

Two coupled fixes, one PR:

1. Force flag on backend.install. Add `Force bool` to
   BackendInstallRequest and thread it through NodeCommandSender ->
   RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op
   drain when retrying an upgrade) sets force=true; routine load events
   and admin install endpoints keep force=false. On the worker, force=true
   stops every live process that uses this backend (resolveProcessKeys
   for peer replicas, plus the exact request processKey), skips the
   findBackend short-circuit, and passes force=true into
   gallery.InstallBackendFromGallery so the on-disk artifact is
   overwritten. After the gallery install completes, startBackend brings
   up a fresh process at the same processKey on a new port.

2. Liveness check on the fast path. installBackend's "already running"
   branch read getAddr without verifying the process was alive, so a
   gRPC backend that died without the supervisor noticing left a stale
   (key, addr) entry. The reconciler then dialed that address, got
   ECONNREFUSED, marked the replica failed, retried install — and the
   supervisor said "already running addr=…" again. Loop forever, exactly
   what we observed on a node whose llama-cpp process had died but whose
   supervisor record persisted. Verify s.isRunning(processKey) before
   trusting getAddr; if the entry is stale, stopBackendExact cleans up
   and we fall through to a real install.

Backwards-compatible: the new Force field is omitempty, older workers
ignore it (their default behavior matches force=false). The signature
change on NodeCommandSender.InstallBackend is internal-only.

Verified: unit tests in core/services/nodes pass (108s suite). The
pre-existing core/backend build break (proto regen pending for
word-level timestamps) blocks core/cli and core/http/endpoints/localai
package tests but is unrelated to this change.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* test(e2e/distributed): pass force=false to adapter.InstallBackend

NodeCommandSender.InstallBackend gained a final force bool in the
upgrade-force commit; the e2e distributed lifecycle tests still called
the old 8-arg signature and broke compilation. These tests exercise the
routine install path (single replica, default behavior), so force=false
preserves their existing semantics.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-07 17:28:14 +02:00
LocalAI [bot]
cec5c4fdfc fix(http): make handler-error status visible in access log + transcription errors (#9707)
* fix(http): log accurate status code when handler returns error

The custom xlog access-log middleware in API() reads res.Status
*before* Echo's central HTTPErrorHandler runs, so when a handler
returns an error without writing a response (e.g.
TranscriptEndpoint's `return err` on backend failure) the status
field stays at its default 200. The logged line then claims
status=200 while the client receives 500 — silently hiding every
500/503/etc. that bubbles up through Echo's error handler.

Mirror echo.DefaultHTTPErrorHandler's status derivation when
err != nil and the response hasn't been committed: default to 500,
upgrade to *echo.HTTPError.Code if applicable. The logged status now
matches what the client actually sees, so failed transcription
requests stop appearing as 200 in the access log.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(transcription): log underlying error before returning 500 to client

ModelTranscriptionWithOptions surfaces real failures — gRPC errors
from a remote node, model load problems, ffmpeg conversion crashes —
but TranscriptEndpoint just did `return err`, so Echo turned it into
a 500 with a generic body and the original error was lost. Operators
chasing transcription failures across distributed mode were left
with "upstream returned 500" on the client and zero context anywhere
in the frontend's logs.

Add an xlog.Error before returning, recording model name, the staged
audio path, and the underlying error. Combined with the access-log
status fix, a failing transcription now leaves an audit trail (real
status code in the access line, real cause in an Error line) instead
of vanishing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-07 17:27:45 +02:00
Richard Palethorpe
c894d9c826 feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos (#9686)
Bring the sglang Python backend up to feature parity with vllm by adding
the same engine_args:-map plumbing the vLLM backend already has. Any
ServerArgs field (~380 in sglang 0.5.11) becomes settable from a model
YAML, including the speculative-decoding flags needed for Multi-Token
Prediction. Validation matches the vllm backend's: keys are checked
against dataclasses.fields(ServerArgs), unknown keys raise ValueError
with a difflib close-match suggestion at LoadModel time, and the typed
ModelOptions fields keep their existing meaning with engine_args
overriding them.

Backend code:
* backend/python/sglang/backend.py: add _apply_engine_args, import
  dataclasses/difflib/ServerArgs, call from LoadModel; rename Seed ->
  sampling_seed (sglang 0.5.11 renamed the SamplingParams field).
* backend/python/sglang/test.py + test.sh + Makefile: six unit tests
  exercising the helper directly (no engine load required).

Build / CI / backend gallery (cuda13 + l4t13 paths are now first-class):
* backend/python/sglang/install.sh: add --prerelease=allow because
  sglang 0.5.11 hard-pins flash-attn-4 which only ships beta wheels;
  add --index-strategy=unsafe-best-match for cublas12 so the cu128
  torch index wins over default-PyPI's cu130; new pyproject.toml-driven
  l4t13 install path so [tool.uv.sources] can pin torch/torchvision/
  torchaudio/sglang to the jetson-ai-lab index without forcing every
  transitive PyPI dep through the L4T mirror's flaky proxy (mirrors the
  equivalent fix in backend/python/vllm/install.sh).
* backend/python/sglang/pyproject.toml (new): L4T project spec with
  explicit-source jetson-ai-lab index. Replaces requirements-l4t13.txt
  for the l4t13 BUILD_PROFILE; other profiles still go through the
  requirements-*.txt pipeline via libbackend.sh's installRequirements.
* backend/python/sglang/requirements-l4t13.txt: removed; superseded
  by pyproject.toml.
* backend/python/sglang/requirements-cublas{12,13}{,-after}.txt: pin
  sglang>=0.5.11 (Gemma 4 floor); add cu130 torch index for cublas13
  (new files) and cu128 torch index for cublas12 (default PyPI now
  ships cu130 torch wheels by default and breaks cu12 hosts).
* backend/index.yaml: add cuda13-sglang and cuda13-sglang-development
  capability mappings + image entries pointing at
  quay.io/.../-gpu-nvidia-cuda-13-sglang.
* .github/workflows/backend.yml: new cublas13 sglang matrix entry,
  mirroring vllm's cuda13 build.

Model gallery + docs:
* gallery/sglang.yaml: base sglang config template, mirrors vllm.yaml.
* gallery/sglang-gemma-4-{e2b,e4b}-mtp.yaml: Gemma 4 MTP demos
  transcribed verbatim from the SGLang Gemma 4 cookbook MTP commands.
* gallery/sglang-mimo-7b-mtp.yaml: MiMo-7B-RL with built-in MTP heads
  + online fp8 weight quantization, verified end-to-end on a 16 GB
  RTX 5070 Ti at ~88 tok/s. Uses mem_fraction_static: 0.7 because the
  MTP draft worker's vocab embedding is loaded unquantised and OOMs
  the static reservation at sglang's 0.85 default.
* gallery/index.yaml: three new entries (gemma-4-e2b-it:sglang-mtp,
  gemma-4-e4b-it:sglang-mtp, mimo-7b-mtp:sglang).
* docs/content/features/text-generation.md: new SGLang section with
  setup, engine_args reference, MTP demos, version requirements.
* .agents/sglang-backend.md (new): agent one-pager covering the flat
  ServerArgs structure, the typed-vs-engine_args precedence, the
  speculative-decoding cheatsheet, and the mem_fraction_static gotcha
  documented above.
* AGENTS.md: index entry for the new agent doc.

Known limitation: the two Gemma 4 MTP gallery entries ship a recipe
that doesn't yet run on stock libraries. The drafter checkpoints
(google/gemma-4-{E2B,E4B}-it-assistant) declare
model_type: gemma4_assistant / Gemma4AssistantForCausalLM, which
neither transformers (<=5.6.0, including the SGLang cookbook's pinned
commit 91b1ab1f... and main HEAD) nor sglang's own model registry
(<=0.5.11) registers as of 2026-05-06. They will start working when
HF or sglang upstream registers the architecture -- no LocalAI
changes needed. The MiMo MTP demo and the non-MTP Gemma 4 paths work
today on this build (verified on RTX 5070 Ti, 16 GB).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] [WebFetch] [WebSearch]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-07 17:27:29 +02:00
Ettore Di Giacinto
048daa0cdc fix(chatterbox): install chatterbox-tts with --no-deps and pin runtime deps
The previous omegaconf pin only addressed one symptom of a deeper problem:
chatterbox-tts upstream depends on `russian-text-stresser` (unpinned git URL),
which transitively pins `spacy==3.6.*` and other ancient packages. That cascade
forces pip to backtrack through Jinja2/MarkupSafe/omegaconf into Python-2-era
sdists that no longer build (e.g. ruamel.yaml<0.15, Jinja2 2.6 importing the
long-removed `setuptools.Feature`).

Install chatterbox-tts itself with --no-deps in install.sh and list its real
runtime deps explicitly in each requirements-*.txt, dropping the optional
russian-text-stresser. This unblocks the darwin (and other) builds without
playing whack-a-mole on each newly-discovered transitive pin.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 09:03:40 +00:00
Ettore Di Giacinto
88f5029a5b chore(deps): bump LocalAGI/LocalRecall — go-pdfium WASM PDF extractor
LocalRecall switched its PDF extractor from gen2brain/go-fitz (cgo +
static libmupdf) to klippa-app/go-pdfium with the WebAssembly backend
(no cgo, no static libs, no glibc symbol issues). This bump pulls in
the new shape via LocalAGI@main → LocalRecall@a7724fe.

Why this matters: the previous go-fitz approach broke aarch64 LocalAI
builds with `__isoc23_strtol undefined reference` link errors —
go-fitz's bundled libmupdf static library is compiled against
glibc ≥ 2.38 (C2x strtol intrinsics) but the LocalAI builder uses
older glibc. The WASM backend has no native dependencies; same Go
binary works on every architecture.

Indirect graph: drops gen2brain/go-fitz, ebitengine/purego,
jupiterrider/ffi; adds klippa-app/go-pdfium + tetratelabs/wazero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 07:48:14 +00:00
Ettore Di Giacinto
7c77d3506a fix(chatterbox): pin omegaconf in every profile requirements file
The previous pin in requirements.txt was ineffective: installRequirements
runs a separate `pip install --requirement` per file, so resolution does
not carry over to the per-profile file where chatterbox-tts is declared.
With chatterbox-tts's unpinned `omegaconf` dep, pip backtracked through
1.x sdists into ruamel.yaml<0.15, whose Python-2-era setup.py fails on
Python 3.10+.

Pin omegaconf==2.3.0 next to chatterbox-tts in every profile file
(matches what upstream chatterbox uses). Drop the dead pin from
requirements.txt.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-07 07:44:37 +00:00
dependabot[bot]
c96ce99742 chore(deps): bump openssl from 0.10.76 to 0.10.79 in /backend/rust/kokoros in the cargo group across 1 directory (#9694)
chore(deps): bump openssl

Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [openssl](https://github.com/rust-openssl/rust-openssl).


Updates `openssl` from 0.10.76 to 0.10.79
- [Release notes](https://github.com/rust-openssl/rust-openssl/releases)
- [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.76...openssl-v0.10.79)

---
updated-dependencies:
- dependency-name: openssl
  dependency-version: 0.10.79
  dependency-type: indirect
  dependency-group: cargo
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-07 08:30:18 +02:00
LocalAI [bot]
840db2fde3 chore(model gallery): 🤖 add 1 new models via gallery agent (#9703)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:56 +02:00
LocalAI [bot]
0b9344ef3d chore: ⬆️ Update leejet/stable-diffusion.cpp to 90e87bc846f17059771efb8aaa31e9ef0cab6f78 (#9701)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:41 +02:00
LocalAI [bot]
151d6c9cf0 chore: ⬆️ Update ggml-org/llama.cpp to 2496f9c14965c39589f53eea31bdb6d762b1d360 (#9698)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:27 +02:00
LocalAI [bot]
659939db9b chore: ⬆️ Update ikawrakow/ik_llama.cpp to b93721902b4662f9b973b1c412006081c958d085 (#9697)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:29:12 +02:00
LocalAI [bot]
392fc9ce3d fix(auth): cascade user deletion across all owned data on PostgreSQL (#9702)
* fix(auth): cascade user deletion across all owned data on PostgreSQL

Deleting a user from the admin UI in distributed mode (PostgreSQL auth
DB) returned "user not found" even when the user clearly existed. The
old handler ignored result.Error and only checked RowsAffected, so a
foreign-key constraint violation surfaced as a misleading 404.

Two issues drove this:

1. invite_codes.created_by / used_by reference users(id) but the
   InviteCode model declared the FKs without ON DELETE CASCADE. On
   PostgreSQL the engine therefore rejected the user delete with NO
   ACTION whenever the user had ever issued or consumed an invite. On
   SQLite (default in single-node mode) FKs are not enforced, so the
   bug never appeared there.
2. Several owned tables were never cleaned up regardless of dialect:
   user_permissions and quota_rules relied on CASCADE that does not
   fire under SQLite, and usage_records have no FK at all and were
   left orphaned in every dialect.

Introduce auth.DeleteUserCascade which runs the full cleanup in a
single transaction: drop invites authored by the user, NULL used_by on
invites they consumed (preserves the audit trail), and explicitly wipe
sessions, API keys, permissions, quota rules, and usage metrics before
deleting the user. The in-memory quota cache is invalidated after
commit so a recreated user with the same id never sees stale entries.
The HTTP handler now maps the helper's errors to proper status codes —
real failures surface as 500 with the cause instead of being swallowed
as "not found".

Add Ginkgo regression coverage in core/http/auth/users_test.go and
core/http/routes/auth_test.go covering invite cleanup, used_by
null-out, full data wipe, and the FK-enforced original failure mode
(via PRAGMA foreign_keys=ON to mirror PostgreSQL behavior on SQLite).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* chore(deps): bump LocalAGI/LocalRecall — pull in go-fitz PDF extraction

Pulls LocalAGI@main (facd888) and LocalRecall@v0.6.0. The latter
swaps PDF text extraction from dslipak/pdf to gen2brain/go-fitz
(libmupdf bindings) and wraps it in a 60s goroutine timeout —
previously certain PDFs (broken xref tables, encrypted, image-only
without OCR) would hang indefinitely inside r.GetPlainText() and
poison the upload queue.

Pure dep bump, no LocalAI source changes. Indirect graph picks up
go-fitz + purego + ffi; drops dslipak/pdf.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 08:28:58 +02:00
LocalAI [bot]
31368092a8 chore(model-gallery): ⬆️ update checksum (#9700)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:28:18 +02:00
LocalAI [bot]
890c070e55 feat(swagger): update swagger (#9699)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-07 08:28:04 +02:00
Tai An
0497bb6595 fix(downloader): list supported URL schemes in DownloadFile error (#9689)
* fix(downloader): list supported URL schemes when input is unrecognized

The error message previously read "does not look like an HTTP URL",
but the downloader actually supports file://, huggingface://, hf://,
ollama://, oci://, and github:// in addition to http(s)://. Users who
type a bare filename or a typo'd scheme (e.g. fle:// instead of file://)
get the misleading impression that only HTTP is accepted.

Reference the existing prefix constants directly via strings.Join so
the scheme list cannot drift when new prefixes are added.

Refs #9683.

Signed-off-by: Tai An <antai12232931@outlook.com>

* fix(downloader): normalize uri.go to LF line endings

Resolves the noisy diff and golangci-lint errcheck warnings on lines I did not actually modify.

* fix(downloader): preserve trailing newline at end of file

---------

Signed-off-by: Tai An <antai12232931@outlook.com>
2026-05-06 21:59:09 +02:00
Ettore Di Giacinto
b2be9729ef fix(chatterbox): pin omegaconf>=2.0 to prevent resolver backtracking
Without an upper-floor pin, pip's resolver backtracks through omegaconf 1.x
sdists when installing chatterbox-tts. Old 1.x setups depend on
ruamel.yaml<0.15, whose setup.py uses Python-2-era names (Str, Bytes) and
fails to build on Python 3.10+, breaking the darwin python backend build.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-05-06 18:07:32 +00:00
LocalAI [bot]
22ff86d64f fix(distributed): round-robin replicas of the same model (#9695)
FindAndLockNodeWithModel previously ordered candidate replicas by
in_flight ASC, available_vram DESC. The primary key is correct, but the
tiebreaker meant that whenever in_flight tied — the common case at low
to moderate concurrency where requests don't overlap — the node with
the largest available VRAM won every pick. With autoscaling placing
replicas of the same model on multiple nodes, the fattest GPU node
ended up taking nearly all the load while the others sat idle.

Insert last_used ASC between the two existing tiers. last_used is
already refreshed inside the same transaction that increments in_flight
(and by TouchNodeModel on cache hits in the router), so the
"oldest-used" replica naturally rotates through the candidate set —
strict round-robin without a schema change. available_vram DESC is
demoted to a final tiebreaker for cold starts where last_used is
identical across replicas.

Placement queries (FindNodeWithVRAM, FindLeastLoadedNode, and the
*FromSet variants) have the same fattest-GPU bias on tiebreakers but
are higher-cost to fix consistently. Deferred to a follow-up so the
routing fix can land first — for the user-observed symptom routing was
the dominant cause anyway.

Test: registry_test.go adds a focused spec that loads three replicas
on three nodes with 24/16/8 GB VRAM and asserts each is picked at
least twice across 9 in_flight-tied calls.


Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] [Grep]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 19:40:54 +02:00
LocalAI [bot]
4e154b59e5 fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64 (#9688)
Two unrelated CI breakages bundled together since both are one-liners:

- rerankers: bump torch 2.4.1 -> 2.7.1 on cpu/cublas12. The unpinned
  transformers resolves to 5.x, whose moe.py registers a custom_op with
  string-typed `'torch.Tensor'` annotations that torch 2.4.1's
  infer_schema rejects, blocking the gRPC server from starting and
  failing all 5 backend tests with "Connection refused" on :50051.
  Matches the version used by the transformers backend.

- vllm-omni: strip fa3-fwd from the upstream requirements/cuda.txt
  before resolving on aarch64. fa3-fwd 0.0.3 ships only an
  x86_64 wheel and has no sdist, making the cuda profile unsatisfiable
  on Jetson/SBSA. fa3-fwd is a soft runtime dep — vllm-omni's
  attention backends fall back to FA2 then SDPA when it's missing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 17:07:24 +02:00
Richard Palethorpe
969005b2a1 feat(gallery): Speed up load times and clean gallery entries (#9211)
* feat: Rework VRAM estimation and use known_usecases in gallery

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]

* chore(gallery): regenerate gallery index and add known_usecases to model entries

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-06 14:51:38 +02:00
LocalAI [bot]
6d56bf98fe feat(importers): add vibevoice-cpp importer for GGUF bundles (#9685)
Routes mudler/vibevoice.cpp-models and similar repos to the vibevoice-cpp
backend. Detects via repo name ("vibevoice.cpp"/"vibevoice-cpp"), file
listing (vibevoice-*.gguf + tokenizer.gguf), or preferences.backend
override. Defaults to the realtime TTS model; preferences.usecase=asr
selects the ASR/diarization variant. Bundles the required tokenizer.gguf
and (for TTS) a voice prompt, emitting the Options[] entries the backend
expects. Registered ahead of VibeVoiceImporter so the C++ bundles aren't
swallowed by the older Python-backend substring match.


Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 13:33:10 +02:00
LocalAI [bot]
a8d7d37a3c fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) (#9682)
* fix(docs): correct broken Hugo relrefs

The Hugo build has been failing on master since the relevant pages
landed:

- text-generation.md:720 referenced `/docs/features/distributed-mode`,
  but Hugo `relref` paths are relative to the content root, not the
  rendered URL. Drop the `/docs/` prefix so the lookup matches the
  existing `features/...` form used elsewhere in the file.
- audio-transform.md:144 referenced `tts.md`; the actual page is
  `text-to-audio.md`.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(kokoros): stub Diarize and AudioTransform Backend trait methods

The recent backend.proto additions (Diarize, AudioTransform,
AudioTransformStream) extended the gRPC Backend trait, breaking
kokoros-grpc compilation with E0046 because the Rust implementation
hadn't picked up the new methods. Add Unimplemented stubs matching the
existing pattern for non-applicable RPCs in this TTS-only backend.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): track upstream ABI + wire 1.5B voice cloning

Two recent commits in mudler/vibevoice.cpp reshaped the vv_capi_tts
signature without a corresponding bump on the LocalAI side:

  3bd759c "1.5b: unify into a single tts entry point" inserted a
          ref_audio_path parameter between voice_path and dst_wav_path.
  ad856bd "1.5b: multi-speaker dialog support" promoted that to a
          (const char* const* ref_audio_paths, int n_ref_audio_paths)
          pair for per-speaker conditioning.

Because purego resolves symbols by name and not by signature, the
build kept linking; at runtime the misaligned arguments turned the
TTS->ASR closed-loop test into a SIGSEGV inside cgo. Track HEAD
explicitly and bring the bridge in line with it:

  * Update the CppTTS purego binding to the 9-arg form. purego
    marshals []*byte as a **char by handing the C side the underlying
    array address; nil/empty maps to NULL, which matches the C
    contract for "no reference audio" on the realtime-0.5B path.
  * Add a `ref_audio` gallery option (comma-separated, repeatable)
    that the 1.5B path consumes for runtime voice cloning. Multiple
    entries are interpreted as one WAV per speaker (Speaker 0..n-1).
  * TTSRequest.Voice now routes by extension/shape: `.wav` or a
    comma-separated list goes to ref_audio_paths; anything else stays
    on voice_path (realtime-0.5B's pre-baked voice gguf).
  * Pin VIBEVOICE_CPP_VERSION to ad856bd and wire the Makefile into
    the existing bump_deps matrix so future upstream rolls land as
    reviewable PRs instead of a silent CI break.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(vibevoice-cpp): use ModelOptions.AudioPath for 1.5B ref audio

Use the existing audio_path field from ModelOptions (already plumbed
through config_file's `audio_path:` YAML and consumed by other audio
backends like kokoros) instead of inventing a custom `ref_audio:`
Options[] string. Multi-speaker setups stay on a single comma-
separated value.

No behavior change beyond the gallery key name; per-call routing via
TTSRequest.Voice is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 10:36:59 +02:00
LocalAI [bot]
06a1524155 chore(model gallery): 🤖 add 1 new models via gallery agent (#9681)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-06 08:47:40 +02:00
LocalAI [bot]
70cf8ac546 fix(backend): resolve relative draft_model paths against the models dir (#9680)
* fix(backend): resolve relative draft_model paths against the models dir

The main model file and mmproj are joined with the configured models
directory before reaching the backend, but draft_model was sent
verbatim. With a relative draft_model in the YAML config, llama.cpp
opens the path from the backend process's CWD and fails with "No such
file or directory", forcing users to hard-code an absolute path.

Mirror the existing mmproj resolution: if draft_model is relative,
join it with modelPath. Absolute paths are passed through unchanged.

Adds an e2e regression test against the mock backend that asserts the
main model file, mmproj, and draft_model all arrive at the backend
resolved to absolute paths.

Closes #9675

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]

* fix(backend): always join draft_model with models dir (drop IsAbs shortcut)

The previous commit kept absolute draft_model paths intact via an
IsAbs check. That left a path-traversal vector open: a user-supplied
YAML config could set draft_model to /etc/passwd (or any other host
file the backend process can read) and the path would be sent through
unchanged.

filepath.Join cleans the leading slash from absolute components, so
joining unconditionally — the way mmproj already does — keeps the
result rooted at the configured models directory regardless of input.

Adds a second e2e spec that feeds an absolute draft_model into the
mock backend and asserts the path is clamped under modelsPath.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7-1m [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 00:58:38 +02:00
LocalAI [bot]
7fab5e3d21 chore: ⬆️ Update ggml-org/whisper.cpp to 4bf733672b2871d4153158af4f621a6dd9104f4a (#9636)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-06 00:34:16 +02:00
Andreas Egli
af83518532 feat: support word-level timestamps for faster-whisper (#9621)
Signed-off-by: Andreas Egli <github@kharan.ch>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-05-06 00:32:52 +02:00
LocalAI [bot]
a315c321c1 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 69d8e4be47243e83b3d0d71e932bc7aa61c644dc (#9638)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-06 00:29:05 +02:00
Ettore Di Giacinto
75fba9e03f fix(distributed): scope Upgrade All to nodes that have the backend installed (#9678)
In distributed mode the React UI's "Upgrade All" button fanned every
detected outdated backend out to every healthy backend node, including
nodes that never had that backend installed. On heterogeneous clusters
this surfaced as platform errors (e.g. mac-mini-m4 asked to upgrade
cpu-insightface-development, which has no darwin/arm64 variant) and left
forever-retrying pending_backend_ops rows.

DistributedBackendManager.UpgradeBackend now queries ListBackends()
first, builds the target node-ID set from SystemBackend.Nodes, and only
fans out to those nodes — every per-node primitive
(adapter.InstallBackend, the pending-ops queue, BackendOpResult) is
unchanged. enqueueAndDrainBackendOp gains an optional targetNodeIDs
allowlist; Install/Delete keep their fan-to-everyone semantics by
passing nil. If no node reports the backend installed, UpgradeBackend
now returns a clear "not installed on any node" error instead of
producing a stuck queue.

Adds Ginkgo coverage for the smart fan-out: backend on a subset of
nodes goes only to those nodes; backend on no node returns the new
error and never sends a NATS install request.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-06 00:28:41 +02:00
Richard Palethorpe
16b2d4c807 fix(python-backend): make JIT subprocesses work on hosts of any size (#9679)
Two related runtime fixes for Python backends that JIT-compile CUDA
kernels at first model load (FlashInfer, PyTorch inductor, triton):

1. libbackend.sh: replace `source ${EDIR}/venv/bin/activate` with a
   minimal manual setup (_activateVenv: export VIRTUAL_ENV, prepend
   PATH, unset PYTHONHOME) computed from $EDIR at runtime. `uv venv`
   and `python -m venv` both bake the create-time absolute path into
   bin/activate (e.g. VIRTUAL_ENV='/vllm/venv' from the Docker build
   stage), so sourcing activate on a relocated venv — copied out of
   the build container and unpacked at an arbitrary backend dir —
   prepends a stale, non-existent path to $PATH. Pip-installed CLI
   tools (e.g. ninja, used by FlashInfer's NVFP4 GEMM JIT) are then
   never found and the load aborts with FileNotFoundError. Doing the
   env setup ourselves matches what `uv run` does internally and
   sidesteps the relocation problem entirely. Generic — every Python
   backend benefits.

2. vllm/run.sh: replace ninja's default -j$(nproc)+2 with an adaptive
   MAX_JOBS = min(nproc, (MemAvailable-4)/4). Each concurrent
   nvcc/cudafe++ peaks at multiple GiB; the default OOM-kills on
   memory-tight hosts (e.g. a 16 GiB desktop loading a 27B NVFP4
   model) but underutilises 100-core / 1 TB boxes. User-set MAX_JOBS
   still wins. Also pin NVCC_THREADS=2 unless overridden.

Refs: https://github.com/vllm-project/vllm/issues/20079

Assisted-by: Claude:claude-opus-4-7 [Edit] [Bash]
2026-05-06 00:28:01 +02:00
Richard Palethorpe
8e43842175 feat(vllm, distributed): tensor parallel distributed workers (#9612)
* feat(vllm): build vllm from source for Intel XPU

Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.

Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):

  - Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
    cp312 wheel.
  - Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
    the dpcpp/sycl compiler from the oneapi-basekit base image.
  - Hide requirements-intel-after.txt during installRequirements
    (it used to 'pip install vllm'); install vllm's deps from a
    fresh git clone of vllm via 'uv pip install -r
    requirements/xpu.txt', swap stock triton for
    triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
    --no-deps .'.
  - requirements-intel.txt trimmed to LocalAI's direct deps
    (accelerate / transformers / bitsandbytes); torch-xpu, vllm,
    vllm_xpu_kernels and the rest come from upstream's xpu.txt
    during the source build.
  - requirements.txt: add pillow + charset-normalizer + chardet --
    used by backend.py and missing on the Intel install profile.
  - run.sh: 'set -x' so backend startup is visible in container
    logs (the gRPC startup error path was previously opaque).

Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.

Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): add multi-node data-parallel follower worker

vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.

Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:

  - Optionally self-registers with the frontend as an agent-type node
    tagged `node.role=vllm-follower` so it's visible in the admin UI
    and operators can scope ordinary models away via inverse
    selectors.
  - Resolves the platform-specific vllm backend via the gallery's
    "vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
  - Runs vLLM as a child process so the heartbeat goroutine survives
    until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
    ZMQ sockets before we tear down.
  - Validates --headless + --start-rank 0 is rejected (rank 0 is the
    head and must serve the API).

Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.

Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.

Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.

Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(vllm): CPU-only end-to-end test for multi-node DP

Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.

Two pre-existing bugs surfaced by the test:

1. extract-backend-% (Makefile) failed for every backend, because all
   backend images end with `FROM scratch` and `docker create` rejects
   an image with no CMD/ENTRYPOINT. Fixed by passing
   --entrypoint=/run.sh -- the container is never started, only
   docker-cp'd, so the path doesn't have to exist; we just need
   anything that satisfies the daemon's create-time validation.

2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
   follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
   absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
   longer resolves once the backend is relocated to BackendsPath.
   _makeVenvPortable's shebang rewriter only matches paths that
   already point at ${EDIR}, so the original shebang slips through
   unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
   as an argument -- Python ignores the script's shebang in that case.

The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image

torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:

  torch._inductor.exc.InductorError:
    InvalidCxxCompiler: No working C++ compiler found in
    torch._inductor.config.cpp.cxx: (None, 'g++')

Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).

`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.

The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).

`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.

The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.

Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml

The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.

Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.

`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.

Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.

Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-06 00:22:50 +02:00
Arkadiusz Tymiński
503904d311 fix(faster-whisper): cast segment timestamps to int after multiplication (#9674)
`int(x) * 1e9` returns a float because `1e9` is a float literal, but
TranscriptSegment.start/end are integer protobuf fields. This caused
every transcription request to fail with:

  TypeError: 'float' object cannot be interpreted as an integer

Multiply first, then cast — `int(x * 1e9)` — to get an int as required.
2026-05-05 23:46:39 +02:00
LocalAI [bot]
d5ce823b83 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 8b56d813a9ed04fa7b7fe2588fddd845cf64eccb (#9677)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 23:46:09 +02:00
LocalAI [bot]
c9141098b6 chore: ⬆️ Update ggml-org/llama.cpp to bbeb89d76c41bc250f16e4a6fefcc9b530d6e3f3 (#9676)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 23:45:54 +02:00
dependabot[bot]
1caab1de10 chore(deps): bump actions/checkout from 4 to 6 (#9663)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-05 15:37:05 +02:00
Ettore Di Giacinto
e86ade54a6 feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654)
* feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp

Closes #1648.

OpenAI-style multipart endpoint that returns "who spoke when". Single
endpoint instead of the issue's three-endpoint sketch (refactor /vad,
/vad/embedding, /diarization) — the typical client wants one call, and
embeddings can land later as a sibling without breaking this surface.

Response shape borrows from Pyannote/Deepgram: segments carry a
normalised SPEAKER_NN id (zero-padded, stable across the response) plus
the raw backend label, optional per-segment text when the backend bundles
ASR, and a speakers summary in verbose_json. response_format also accepts
rttm so consumers can pipe straight into pyannote.metrics / dscore.

Backends:

* vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass.
  vibevoice's ASR prompt asks the model to emit
  [{Start,End,Speaker,Content}] natively, so diarization is a by-product
  of the same pass; include_text=true preserves the transcript per
  segment, otherwise we drop it.

* sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization
  C API (pyannote segmentation + speaker-embedding extractor + fast
  clustering). libsherpa-shim grew config builders, a SetClustering
  wrapper for per-call num_clusters/threshold overrides, and a
  segment_at accessor (purego can't read field arrays out of
  SherpaOnnxOfflineSpeakerDiarizationSegment[] directly).

Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment /
DiarizeResponse messages, threaded through interface.go, base, server,
client, embed. Default Base impl returns unimplemented.

Capability surfaces all updated: FLAG_DIARIZATION usecase,
FeatureAudioDiarization permission (default-on), RouteFeatureRegistry
entries for /v1/audio/diarization and /audio/diarization, audio
instruction-def description widened, CAP_DIARIZATION JS symbol,
swagger regenerated, /api/instructions discovery map updated.

Tests:

* core/backend: speaker-label normalisation (first-seen → SPEAKER_NN,
  per-speaker totals, nil-safety, fallback to backend NumSpeakers when
  no segments).

* core/http/endpoints/openai: RTTM rendering (file-id basename, negative
  duration clamping, fallback id).

* tests/e2e: mock-backend grew a deterministic Diarize that emits
  raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN
  remapping, verbose_json speakers summary + transcript pass-through
  (gated by include_text), RTTM bytes content-type, and rejection of
  unknown response_format. mock-diarize model config registered with
  known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard.

Docs: new features/audio-diarization.md (request/response, RTTM example,
sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry
in whats-new.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(diarization): correct sherpa-onnx symbol name + lint cleanup

CI failures on #9654:

* sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked
  at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`.
  Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult
  (Destroy in the middle, not the prefix); the rest of the diarization
  surface follows the same naming pattern. The mismatched name made
  purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server
  before the BeforeAll could probe Health, taking down every sherpa-onnx
  test job — not just the diarization-related ones.

* golangci-lint flagged 5 errcheck violations on new defer cleanups
  (os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()`
  closure (matches the pattern other LocalAI files use for new code, since
  pre-existing bare defers are grandfathered in via new-from-merge-base).

* golangci-lint also flagged forbidigo violations: the new
  diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`,
  which are forbidden by the project's coding-style policy
  (.agents/coding-style.md). Convert both files to Ginkgo/Gomega
  Describe/It with Expect(...) — they get picked up by the existing
  TestBackend / TestOpenAI suites, no new suite plumbing needed.

* modernize linter: tightened the diarization segment loop to
  `for i := range int(numSegments)` (Go 1.22+ idiom).

Verified locally: golangci-lint with new-from-merge-base=origin/master
reports 0 issues across all touched packages, and the four mocked
diarization e2e specs in tests/e2e/mock_backend_test.go still pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget

Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k
loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3:
both /v1/audio/transcriptions and /v1/audio/diarization now succeed and
return correctly attributed speaker turns for the full clip.

Two latent issues surfaced once the diarization endpoint actually exercised
the backend with a non-trivial input:

1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code
   passed the uploaded path straight through, so anything that wasn't
   already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and
   the very unhelpful "vv_capi_asr failed". prepareWavInput shells out
   to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp
   dir, matching the rate the model was trained on; both AudioTranscription
   and Diarize now route through it. This is the same shape sherpa-onnx
   uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz
   so we don't reuse that helper.

2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's
   fine for a five-second clip but not for anything past ~10 s — vibevoice
   stops mid-JSON, the parse fails, and the caller sees a hard error.
   Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the
   model's ~30 tok/s rate); generation stops at EOS so this is a cap
   rather than a target.

3. As a defensive belt-and-braces, mirror AudioTranscription's existing
   "fall back to a single segment if the model emits non-JSON text"
   pattern in Diarize, so partial / unusual model output never produces
   a 500. This kept the endpoint usable while diagnosing (1) and (2),
   and is the right behaviour to keep.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime

Spotted by tests-e2e-backend (1.25.x): the previous fix forced every
incoming audio file through `ffmpeg -ar 24000 ...`, which meant the
backend container — which does not ship ffmpeg — failed even for the
existing happy path where the caller already uploads a WAV. The
container-side error was:

    rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to
    24k mono wav: exec: "ffmpeg": executable file not found in $PATH

Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and
already accepts any PCM/IEEE-float WAV at any sample rate, downmixes
multi-channel input to mono, and resamples to 24 kHz internally. So the
only inputs that genuinely need an external converter are non-WAV
formats (MP3, OGG, FLAC, ...).

Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them
straight through with a no-op cleanup; everything else still goes
through ffmpeg with the same 24 kHz mono s16le target. The result:

* Container builds without ffmpeg keep working for WAV uploads
  (the e2e-backends fixture is jfk.wav at 16 kHz mono s16le).
* MP3 and other non-WAV inputs still get the new ffmpeg conversion
  path so the diarization endpoint stays useful.
* If the caller uploads a non-WAV but ffmpeg isn't on PATH, the
  surfaced error is still descriptive enough to act on.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases

The LocalVQE PR (bb033b16) made `gcc-14 g++-14` an unconditional apt
install in backend/Dockerfile.golang and pointed update-alternatives at
them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has
gcc-14 in main), but every Go backend that builds on
`nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails
at the apt step:

    E: Unable to locate package gcc-14

This blocked unrelated jobs:
backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper,
acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only
matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually
need gcc-14 anywhere else.

Make the gcc-14 install conditional on the package being available in
the configured apt repos. On noble: identical behaviour to today (gcc-14
installed, update-alternatives points at it). On jammy: skip the
gcc-14 stanza entirely and let build-essential's default gcc take over,
which is what the other Go backends compile with anyway.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-05 15:10:13 +02:00
LocalAI [bot]
1634eece6b chore: ⬆️ Update ikawrakow/ik_llama.cpp to 45dfd80371785731bc2ed05a76252497a4e7a282 (#9644)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 15:09:40 +02:00
LocalAI [bot]
b88ddce0f3 chore: ⬆️ Update ggml-org/llama.cpp to eff06702b2a52e1020ea009ebd86cb9f5acabab5 (#9637)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 09:52:28 +02:00
Ettore Di Giacinto
bbcaebc1ef feat(concurrency-groups): per-model exclusive groups for backend loading (#9662)
* feat(concurrency-groups): per-model exclusive groups for backend loading

Adds `concurrency_groups: [...]` to model YAML configs. Two models that share
a group cannot be loaded concurrently on the same node — loading one evicts
the others, reusing the existing pinned/busy/retry policy from LRU eviction.

Layered design:
- Watchdog (pkg/model): per-node correctness floor — on every Load(), evict
  any loaded model that shares a group with the requested one. Pinned skips
  surface NeedMore so the loader retries (and ultimately logs a clear
  warning), instead of silently allowing the rule to be violated.
- Distributed scheduler (core/services/nodes): soft anti-affinity hint —
  scheduleNewModel prefers nodes that don't already host a same-group
  model, falling back to eviction only if every candidate has a conflict.
  Composes with NodeSelector at the same point in the candidate pipeline.

Per-node, not cluster-wide: VRAM is a node-local resource, and two heavy
models running on different nodes is fine. The ConfigLoader is wired into
SmartRouter via a small ConcurrencyConflictResolver interface so the nodes
package keeps a narrow surface on core/config.

Refactors the inner LRU eviction body into a shared collectEvictionsLocked
helper and the loader retry loop into retryEnforce(fn, maxRetries, interval),
so both LRU and group enforcement share busy/pinned/retry semantics.

Closes #9659.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(watchdog): sync pinned + concurrency_groups at startup

The startup-time watchdog setup lives in initializeWatchdog (startup.go),
not in startWatchdog (watchdog.go). The latter is only invoked from the
runtime-settings RestartWatchdog path. As a result, neither
SyncPinnedModelsToWatchdog nor SyncModelGroupsToWatchdog ran at boot,
so `pinned: true` and `concurrency_groups: [...]` only became effective
after a settings-driven watchdog restart.

Fix by adding both sync calls to initializeWatchdog. Confirmed end-to-end:
loading model A in group "heavy", then C with no group (coexists),
then B in group "heavy" now correctly evicts A and leaves [B, C].

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(test): satisfy errcheck on new os.Remove in concurrency_groups spec

CI lint runs new-from-merge-base, so the existing pre-existing
`defer os.Remove(tmp.Name())` lines are baseline-grandfathered but the
one introduced by the concurrency_groups YAML round-trip test is held
to errcheck. Wrap the remove in a closure that discards the error.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-05 08:42:50 +02:00
dependabot[bot]
22ae415695 chore(deps): bump docs/themes/hugo-theme-relearn from f69a085 to 8bb66fa (#9665)
chore(deps): bump docs/themes/hugo-theme-relearn

Bumps [docs/themes/hugo-theme-relearn](https://github.com/McShelby/hugo-theme-relearn) from `f69a085` to `8bb66fa`.
- [Release notes](https://github.com/McShelby/hugo-theme-relearn/releases)
- [Commits](f69a085322...8bb66fa674)

---
updated-dependencies:
- dependency-name: docs/themes/hugo-theme-relearn
  dependency-version: 8bb66fa674351f3a0b0917a7552caac686eca920
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-05 08:42:27 +02:00
LocalAI [bot]
3a0164670e chore: ⬆️ Update vllm-project/vllm cu130 wheel to 0.20.1 (#9649)
⬆️ Update vllm-project/vllm cu130 wheel

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 08:41:55 +02:00
LocalAI [bot]
a91b05907c feat(swagger): update swagger (#9660)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 01:50:17 +02:00
LocalAI [bot]
4ef45bbccd chore(model-gallery): ⬆️ update checksum (#9661)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-05 00:12:22 +02:00
Beshoy Girgis
b224a3d931 deps: update quic-go to v0.59.0 (fix session ticket panic) (#9655)
Update quic-go from v0.54.1 to v0.59.0 to fix the crypto/tls session
ticket panic described in quic-go/quic-go#5572.

Co-dependency go-libp2p upgraded from v0.43.0 to v0.48.0 (required for
quic-go v0.59.0 compatibility).

Signed-off-by: Beshoy Girgis <shoy@1ds.us>
2026-05-04 22:07:42 +02:00
Richard Palethorpe
bb033b16a9 feat: add LocalVQE backend and audio transformations UI (#9640)
feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI

Introduce a generic "audio transform" capability for any audio-in / audio-out
operation (echo cancellation, noise suppression, dereverberation, voice
conversion, etc.) and ship LocalVQE as the first backend implementation.

Backend protocol:
- Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and
  bidirectional AudioTransformStream for low-latency frame-by-frame use.
  This is the first bidi stream in the proto; per-frame unary at LocalVQE's
  16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server,
  embed,interface,base} with paired-channel ergonomics.

LocalVQE backend (backend/go/localvqe/):
- Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream
  shared lib + its libggml-cpu-*.so runtime variants directly — no MODULE
  wrapper needed because LocalVQE handles CPU feature selection internally
  via GGML_BACKEND_DL.
- Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it
  LocalVQE runs single-threaded at ~1× realtime instead of the documented
  ~9.6×.
- Reference-length policy: zero-pad short refs, truncate long ones (the
  trailing portion can't have leaked into a mic that wasn't recording).
- Ginkgo test suite (9 always-on specs + 2 model-gated).

HTTP layer:
- POST /audio/transformations (alias /audio/transform): multipart batch
  endpoint, accepts audio + optional reference + params[*]=v form fields.
  Persists inputs alongside the output in GeneratedContentDir/audio so the
  React UI history can replay past (audio, reference, output) triples.
- GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames
  (interleaved stereo mic+ref in, mono out). JSON session.update envelope
  for config; constants hoisted in core/schema/audio_transform.go.
- ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing
  utils.AudioToWav (with passthrough fast-path), so the user can upload any
  format / rate without seeing the model's strict 16 kHz constraint.
- BackendTraceAudioTransform integration so /api/backend-traces and the
  Traces UI light up with audio_snippet base64 and timing.
- Routes registered under routes/localai.go (LocalAI extension; OpenAI has
  no /audio/transformations endpoint), traced via TraceMiddleware.

Auth + capability + importer:
- FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on,
  in APIFeatures), three RouteFeatureRegistry rows.
- localvqe added to knownPrefOnlyBackends with modality "audio-transform".
- Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on
  huggingface.co/LocalAI-io/LocalVQE).

React UI:
- New /app/transform page surfaced via a dedicated "Enhance" sidebar
  section (sibling of Tools / Biometrics) — the page is enhancement, not
  generation, so it lives outside Studio. Two AudioInput components
  (Upload + Record tabs, drag-drop, mic capture).
- Echo-test button: records mic while playing the loaded reference through
  the speakers — the mic naturally picks up speaker bleed, giving a real
  (mic, ref) pair for AEC testing without leaving the UI.
- Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls)
  and useAudioPeaks hook (shared module-scoped AudioContext to avoid
  hitting browser context limits with three players on one page); migrated
  TTS, Sound, Traces audio blocks to use it.
- Past runs saved in localStorage via useMediaHistory('audio-transform') —
  the history entry stores all three URLs so clicking re-renders the full
  triple, not just the output.

Build + e2e:
- 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm,
  SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those
  two and let GPU-class hardware route through Vulkan in the gallery
  capabilities map.
- tests-localvqe-grpc-transform job in test-extra.yml (gated on
  detect-changes.outputs.localvqe).
- New audio_transform capability + 4 specs in tests/e2e-backends.
- Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js
  (8 specs covering tabs, file upload, multipart shape, history, errors).

Docs:
- New docs/content/features/audio-transform.md covering the (audio,
  reference) mental model, batch + WebSocket wire formats, LocalVQE param
  keys, and a YAML config example. Cross-links from text-to-audio and
  audio-to-text feature pages.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-04 22:07:11 +02:00
LocalAI [bot]
de83b72bb7 fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam (#9657)
* fix(nodes/health): skip stale-marking already-offline nodes

The health monitor re-emitted "Node heartbeat stale" + "Marking stale
node offline" + MarkOffline on every cycle for nodes that were already
in the offline (or unhealthy) state. For an operator-stopped node this
flooded the logs with the same WARN+INFO pair every check interval.

Skip the staleness branch when the node is already StatusOffline /
StatusUnhealthy — the state is already what we'd write, so neither the
log lines nor the DB update carry information.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(worker): wait for backend gRPC bind before replying to backend.install

The backend supervisor used to wait up to 4s (20 × 200ms) for the
backend's gRPC server to answer a HealthCheck, then log a warning and
reply Success with the bind address anyway. On slower nodes (a Jetson
Orin doing first-boot CUDA init, large CGO library load) the gRPC
listener wasn't up yet, so the frontend's first LoadModel dial returned
"connect: connection refused" and the operator chased a phantom network
issue instead of a startup-timing one.

Two changes:

  - Bump the readiness window to 30s. CUDA init on Orin/Thor first boot
    measures in seconds, not milliseconds.
  - On deadline-exceeded, stop the half-started process, recycle the
    port, and return an error with the backend's stderr tail. The
    frontend now gets a real failure with diagnostic context instead of
    a misleading ECONNREFUSED on a downstream dial.

Process death during the wait window keeps its existing fast-fail path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): route auto-upgrade through BackendManager + bump LocalAGI/LocalRecall

Two distributed-mode bugs that surfaced together in the orchestrator
logs:

1. Auto-upgrade always failed with "backend not found".

   UpgradeChecker correctly routed CheckUpgrades through the active
   BackendManager (so the frontend aggregates worker state), but the
   auto-upgrade branch right below called gallery.UpgradeBackend
   directly with the frontend's SystemState. In distributed mode the
   frontend has no backends installed locally, so ListSystemBackends
   returned empty and Get(name) failed for every reported upgrade.
   Auto-upgrade now also goes through BackendManager.UpgradeBackend,
   which fans out to workers via NATS.

2. Embedding-load failure on a remote node crashed the orchestrator.

   When RAG init lazily called NewPersistentPostgresCollection and the
   remote embedding worker was unreachable, LocalRecall called
   os.Exit(1) inside the constructor, killing the orchestrator pod.
   LocalRecall now returns errors instead, LocalAGI surfaces them as a
   nil collection, and the existing RAGProviderFromState path returns
   (nil, nil, false) — the same code path the agent pool already takes
   when no RAG is configured. The orchestrator stays up; chat requests
   degrade to "no RAG available" until the embedding worker recovers.

Bumps:
  github.com/mudler/LocalAGI    → e83bf515d010
  github.com/mudler/localrecall → 6138c1f535ab

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-04 19:09:16 +02:00
LocalAI [bot]
1aeb4d7e73 chore(model gallery): 🤖 add 1 new models via gallery agent (#9653)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-04 15:42:08 +02:00
Ettore Di Giacinto
a271c72931 fix(react-ui/e2e): scope backendTrigger to <main> so it skips LanguageSwitcher
The LanguageSwitcher added in the i18n PR (#9642) lives in the sidebar
and also uses aria-haspopup="listbox" — same attribute the import-form
SearchableSelect uses. The Batch D / E tests' helper resolved the trigger
with `page.locator('button[aria-haspopup="listbox"]').first()`, which now
returns the language switcher (rendered first in DOM order, in the
sidebar) instead of the backend dropdown.

After clicking the wrong button, getByRole('option', { name: 'llama-cpp' })
naturally never resolves — language options aren't backend names — and
the test times out at 30s.

Scope the locator to the <main className="main-content"> wrapper so only
buttons inside the route's main content area match. The page layout has
the Sidebar outside <main>, so this cleanly excludes it.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-04 08:58:25 +00:00
Ettore Di Giacinto
ade5fd4b97 fix(react-ui): reflect disabled state on SearchableSelect button
The Backend dropdown is disabled while /backends/known is in flight
(disabled={isSubmitting || backendsLoading} in ImportModel.jsx). Until
now the disabled prop only guarded the internal onClick handler — there
was no `disabled` HTML attribute on the <button>, so the element
remained "actionable" from the outside.

That regressed the import-form-ux Batch D / E Playwright tests after
the i18next-suspense PR (#9642): suspending on the importModel
namespace defers the useEffect that fetches /backends/known, so when
the test calls backendTrigger.click() the button is rendered but
backendsLoading is still true. The click hits the no-op branch,
the dropdown stays closed, and `getByRole('option', { name: 'llama-cpp' })`
times out at 30s.

Surfacing the disabled state on the actual <button> makes Playwright
auto-wait until the dropdown is ready, fixes a11y (screen readers now
announce "disabled"), and removes the button from the tab order while
loading.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-04 08:21:03 +00:00
LocalAI [bot]
170d55c67d fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups (#9652)
* fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups

Two distinct bugs were causing tight retry loops in the distributed scheduler:

1. FindAndLockNodeWithModel ignored the model's NodeSelector. When a model
   was loaded on multiple nodes and only some matched the current selector,
   the function returned the lowest-in_flight node — even one the selector
   excluded. Route()'s post-check then fell through to scheduleNewModel,
   which targeted the matching node where the model was already at
   MaxReplicasPerModel capacity. Eviction couldn't help (the only loaded
   model on that node was the one being requested, and it was busy), so
   every request looped through "evicting LRU" → "all models busy".

   Fix: thread an optional candidateNodeIDs filter through
   FindAndLockNodeWithModel. Route() resolves the selector once via a new
   resolveSelectorCandidates helper and passes the matching IDs to both
   the cached-replica lookup and scheduleNewModel. The same helper
   replaces the inline selector block in scheduleNewModel.

2. ScheduleAndLoadModel (reconciler scale-up path) fell back to
   scheduleNewModel with backendType="" when no replica had ever been
   loaded for a model. The worker rejected the resulting backend.install
   ("backend name is empty") on every reconciler tick (~30s).

   Fix: remove the broken fallback. When GetModelLoadInfo has nothing
   stored, return a clear error instead of firing a doomed NATS install.
   The reconciler's existing scale-up failure log surfaces it once per
   tick; the model auto-replicates as soon as Route() serves it once and
   stores load info.

Also downgrade the post-LoadModel-failure StopGRPC error to Debug — that
cleanup attempt usually hits "model not found" because LoadModel failed
before registering the process, and the outer "Failed to load model"
error already carries the real reason.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

* test(distributed): cover selector-aware FindAndLockNodeWithModel and reconciler scaleup guard

Two regression tests for the bugs fixed in the previous commit:

1. FindAndLockNodeWithModel — registry-level integration tests verify the
   candidateNodeIDs filter:
   - Returns the included node even when an excluded node has lower
     in_flight (the original selector-mismatch loop scenario).
   - Returns not-found when the model is loaded only on excluded nodes,
     forcing Route() to fall through to a fresh schedule instead of
     reusing the excluded replica.

2. ScheduleAndLoadModel — mock-based test verifies the reconciler scale-up
   path returns an error and does NOT fire backend.install when no replica
   has been loaded yet. fakeUnloader gains an installCalls slice so this
   negative assertion is direct.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-04 09:42:14 +02:00
Ettore Di Giacinto
28b4857bd6 fix(ci): leave ports.ubuntu.com upstream on self-hosted runners
mirrors.edge.kernel.org carries /ubuntu/ (amd64 archive) but does NOT
carry /ubuntu-ports/. With the previous default both archive and ports
pointed at kernel.org, so multi-arch builds (linux/amd64,linux/arm64)
on bigger-runner / arc-runner-set 404'd on the arm64 leg:

  Err:5 http://mirrors.edge.kernel.org/ubuntu-ports noble Release
    404  Not Found [IP: 213.196.21.55 80]

The original outage was on archive.ubuntu.com, not ports.ubuntu.com, so
default the self-hosted-ports-mirror to '' (= keep ports.ubuntu.com
upstream). apt-mirror.sh and the runner-side rewrite both already
no-op when the env var is empty.

Self-hosted amd64 still uses kernel.org for the main archive, which
worked fine in this run before the arm64 leg failed.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-04 07:28:43 +00:00
Ettore Di Giacinto
5503be1fb3 fix(ci): use http for the kernel.org mirror — bare ubuntu image has no CA bundle
The Docker build runs on the minimal ubuntu:24.04 base image, which
ships *without* ca-certificates. The very first apt-get update over
HTTPS therefore fails the TLS handshake ("No system certificates
available. Try installing ca-certificates."), and apt can't reach
ca-certificates itself to fix the situation — chicken and egg.

Apt validates package integrity via GPG-signed Release files, so plain
HTTP is safe for the archive. archive.ubuntu.com / azure.archive are
already accessed over HTTP for the same reason. Switch the kernel.org
defaults from https://mirrors.edge.kernel.org to
http://mirrors.edge.kernel.org so the in-Dockerfile rewrite works on
self-hosted runners too.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-03 23:29:53 +00:00
Ettore Di Giacinto
50580a84ae fix(ci): switch apt mirror per runner — azure on github-hosted, kernel.org on self-hosted
Self-hosted runners (arc-runner-set, bigger-runner) cannot reach
azure.archive.ubuntu.com — they live in different networks (e.g. our
arc-runner-set Kubernetes cluster) where Azure's mirror IP is not
routable. Symptom: "Connection failed [IP: 51.11.236.225 80]" with each
Ign:/Err: cycle taking 60s, hanging the build for ~16 minutes before
exit 100.

Pick the mirror based on `runner.environment`:

  * github-hosted (ubuntu-latest, ubuntu-24.04-arm) → Azure
    (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
    — same VPC as the runner.
  * self-hosted (arc-runner-set, bigger-runner)    → kernel.org
    (https://mirrors.edge.kernel.org for both archive and ports)
    — publicly reachable from any network.

The choice now lives in one place: the .github/actions/configure-apt-mirror
composite action exposes `effective-mirror` / `effective-ports-mirror`
outputs so the reusable workflows can forward the same value as Docker
build-args without duplicating the per-runner-environment branch.

The now-redundant `apt-mirror` / `apt-ports-mirror` workflow inputs on
image_build.yml and backend_build.yml are dropped — defaults live in the
composite action and are visible there.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-03 22:59:26 +00:00
Ettore Di Giacinto
8edac61e57 feat(ci): allow routing apt traffic through an alternate Ubuntu mirror (#9650)
* feat(ci): allow routing apt traffic through an alternate Ubuntu mirror

Adds opt-in APT_MIRROR / APT_PORTS_MIRROR knobs to all Dockerfiles, the
Makefile, and CI workflows so we can fail over to a non-canonical Ubuntu
mirror when archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com
are degraded (recently observed: multi-day DDoS against the default pool).

Defaults are empty everywhere — behavior is unchanged unless a mirror is
configured. To enable in CI, set the repo-level GitHub Actions variables
APT_MIRROR (and APT_PORTS_MIRROR for arm64 builds). Locally:
    make docker APT_MIRROR=http://azure.archive.ubuntu.com

A small POSIX-sh helper in .docker/apt-mirror.sh rewrites both DEB822
(/etc/apt/sources.list.d/ubuntu.sources, Ubuntu 24.04+) and the legacy
/etc/apt/sources.list before the first apt-get update. Dockerfile stages
load it via RUN --mount=type=bind, so there is no extra layer and no
cache invalidation when the script is unchanged. Reusable workflows also
rewrite the runner's own /etc/apt sources before any sudo apt-get call.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(apt-mirror): default to the Azure mirror, visible in the workflow source

Bakes Azure (http://azure.archive.ubuntu.com / http://azure.ports.ubuntu.com)
in as the default for both Docker builds and runner-side apt — rather than
hiding the URL behind a GitHub Actions repo variable that's not visible
from the source tree.

A new composite action at .github/actions/configure-apt-mirror is the
single source of truth for runner-side rewrites. Five standalone
workflows (build-test, release, tests-e2e, tests-ui-e2e, update_swagger)
just `uses: ./.github/actions/configure-apt-mirror`.

Three workflows (image_build, backend_build, checksum_checker) keep an
inline bash rewrite, because they install/upgrade git via apt *before*
the checkout step (so the local composite action isn't loadable yet).
The Azure URL is visible in those files too.

The `apt-mirror` / `apt-ports-mirror` inputs of the reusable workflows
keep their now-Azure defaults — they still feed the Docker build-args
block in addition to the inline runner-side rewrite. Callers (image.yml,
image-pr.yml, backend.yml, backend_pr.yml) drop the previous
`vars.APT_MIRROR` plumbing and rely on those defaults.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(apt-mirror): drop Force Install GIT, consolidate on the composite action

The PPA git upgrade ran add-apt-repository ppa:git-core/ppa, which talks
to api.launchpad.net — also part of Canonical's infrastructure and
currently returning HTTP 504. The Azure mirror only covers
archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com, not PPAs.

The system git that ubuntu-latest already ships is sufficient for
actions/checkout and the build pipeline, so just drop the upgrade. With
that gone, the apt-before-checkout constraint disappears too — all three
holdouts (image_build, backend_build, checksum_checker) can now switch
to ./.github/actions/configure-apt-mirror like the other five.

Net: 0 inline apt-mirror blocks, all 8 workflows route through the
composite action.

Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-03 23:50:13 +02:00
Tai An
0b024f0886 chore(model gallery): add chroma1-hd diffusers model (#9646)
Resolves https://github.com/mudler/LocalAI/issues/9604

Adds Chroma1-HD (lodestones/Chroma1-HD), an 8.9B-parameter
text-to-image model derived from FLUX.1-schnell, served via the
upstream-diffusers ChromaPipeline. Inference defaults follow the
model card recommendations: 40 steps, CFG 3.0, bfloat16.

Assisted-by: claude-code:opus-4.7
2026-05-03 09:06:31 +02:00
Ettore Di Giacinto
a6121e240e docs: credit the LocalAI maintainers team
Update README and docs to attribute maintenance to the LocalAI team
(Ettore Di Giacinto and Richard Palethorpe) and drop the autonomous
AI dev team section.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash]
2026-05-02 23:37:04 +00:00
Ettore Di Giacinto
87cf736068 feat(react-ui): add multilingual (i18n) support (#9642)
Adds end-to-end internationalization to the React UI with five seed
languages (English, Italian, Spanish, German, Simplified Chinese) and
a sidebar-footer language switcher next to the existing theme toggle.

Library: react-i18next + i18next + i18next-http-backend +
i18next-browser-languagedetector. The detector caches the user's
choice in localStorage (key `localai-language`, mirroring the existing
`localai-theme` convention) and updates the `<html lang>` attribute on
change. fallbackLng is `en`, so any missing translation in another
locale falls back transparently.

Translation files live under `public/locales/<lng>/<ns>.json`. They
ride along with the existing `//go:embed react-ui/dist/*` directive,
but the previous SPA route in core/http/app.go only exposed
`/assets/*` from the embedded React build. This commit generalizes
the asset handler into a `serveReactSubdir(subdir)` helper and adds a
matching `/locales/*` route so i18next-http-backend can fetch the
JSONs at runtime. The http-backend `loadPath` is built via the
existing `apiUrl()` helper so instances served under a sub-path (e.g.
`<base href="/ui/">`) resolve correctly.

Namespaces (13): common, nav, errors, auth, home, models, importModel,
chat, agents, skills, collections, media, admin. Translated UI surfaces
include the sidebar/header/footer chrome, login + account flows, the
Home dashboard (incl. the manage-by-chat assistant CTA), the model
gallery + import flow, the chat experience (Chat.jsx + ChatsMenu),
agents/skills/collections list pages, the studio media tabs (Image,
Video, TTS), and the admin page-headers (Settings incl. its section
nav, Manage, Backends, Traces, Nodes, P2P, Users, Usage). Shared
components (ConfirmDialog, Toast) take their default labels from the
common namespace so callers don't need to pass strings explicitly.

Tooling for incremental adoption is included:
  - `i18next-parser.config.js` + `npm run i18n:extract` to sweep `t()`
    keys into the JSON skeletons.
  - `scripts/translate-locales.mjs` (one-off helper) to bootstrap
    non-English locales from English source via OpenAI or Anthropic
    APIs, with --copy mode as a placeholder fallback. Idempotent;
    preserves existing translations unless --overwrite is passed.

Larger config-driven pages (ModelEditor, Settings deep field forms,
AgentChat/AgentCreate, SkillEdit, CollectionDetails, Talk, Sound,
biometrics, FineTune/Quantize, Users modals, Nodes/P2P install
pickers, BackendLogs, Traces deep filters, Explorer) intentionally
keep their inner content untranslated for now — they fall back to
English via fallbackLng so functionality is unaffected, and the
extracted-strings pattern + the bootstrap script make follow-up
extraction straightforward.

The initial Suspense fallback at the root in main.jsx covers the
first JSON fetch on cold load. A simple `.app-boot-spinner` styled
in App.css provides a non-empty paint while the first namespace
loads.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-02 22:42:08 +02:00
LocalAI [bot]
1ad5b5907d feat(swagger): update swagger (#9643)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-02 22:33:47 +02:00
Russell Sim
18e039f305 fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626)
* fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds

399c1dec wired amdgpu-targets through the backend_build workflow_call
interface, intending the input's default value to cover matrix entries
that don't specify targets. However, GitHub Actions only applies a
workflow_call input default when the caller omits the input entirely.
When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}`
and the matrix entry has no amdgpu-targets key, the expression evaluates
to an empty string, which is treated as an explicit value — bypassing
the default. The result is Docker receiving AMDGPU_TARGETS="" which in
turn causes Make's ?= default to be skipped (since the variable is
already set in the environment, even to empty), and cmake gets
-DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an
indeterminate target rather than the intended GPU list.

Fix this at two levels:

1. backend.yml: use a || fallback in the expression so that an undefined
   matrix.amdgpu-targets never reaches the reusable workflow as an empty
   string. The target list is the canonical default and lives here.

2. backend_build.yml: remove the now-misleading default value from the
   input declaration. The default never fired due to the above bug, so
   keeping it implied a guarantee that didn't exist.

3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard
   after the ?= assignment so that if AMDGPU_TARGETS is empty (whether
   from environment or any future CI wiring mistake) the build fails
   immediately with a clear message rather than silently producing a
   binary compiled for an unknown GPU target.

Assisted-by: Claude Code:claude-sonnet-4-6
Signed-off-by: Russell Sim <rsl@simopolis.xyz>

* fix(build): plumb AMDGPU_TARGETS through to Docker builds

The docker-build-backend Makefile macro and Dockerfile.golang did not
pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds
always used the backend Makefile's hardcoded default GPU targets
regardless of what was specified via environment or CI inputs.

Signed-off-by: Russell Sim <rsl@simopolis.xyz>

---------

Signed-off-by: Russell Sim <rsl@simopolis.xyz>
2026-05-02 15:53:14 +02:00
Ettore Di Giacinto
b1a99436c7 feat(branding): admin-configurable instance name, tagline, and assets (#9635)
Adds a whitelabeling feature so an operator can replace the LocalAI
instance name, tagline, square logo, horizontal logo, and favicon from
the admin Settings page. Defaults fall back to the bundled assets so
existing installs are unaffected.

The public GET /api/branding endpoint is reachable pre-auth so the
login screen can render the configured branding before sign-in.
Mutating routes (POST/DELETE /api/branding/asset/:kind) remain
admin-only. Text fields (instance_name, instance_tagline) ride the
existing /api/settings flow; binary assets get a dedicated multipart
upload route that persists files under DynamicConfigsDir/branding/.

To prevent the Settings page's stale local state from clobbering an
upload on save, UpdateSettingsEndpoint preserves whatever the on-disk
asset filename fields are regardless of the body — /api/branding/asset/*
are the sole writers for those fields.

The MCP catalog gains get_branding and set_branding tools (text fields
only; file upload stays UI-only) plus a configure_branding skill prompt.

While wiring this up, the same restart-loss class of bug surfaced for
several existing fields whose RuntimeSettings entries were never read
by the startup loader. Fix loadRuntimeSettingsFromFile() to load:

  - branding (instance_name, instance_tagline, *_file basenames)
  - auto_upgrade_backends, prefer_development_backends
  - localai_assistant_enabled
  - open_responses_store_ttl
  - the 7 existing AgentPool fields (enabled, default/embedding model,
    chunking sizes, enable_logs, collection_db_path)

Also exposes 3 new AgentPool runtime settings (vector_engine,
database_url, agent_hub_url) via /api/settings + the Settings UI, with
the same load-on-startup wiring. The file watcher's manual-edit path
is intentionally not changed — the in-process API endpoints already
update appConfig directly, so the watcher is redundant for supported
flows and a separate refactor for everything else.

15 TDD specs cover the loader behaviour (1 branding + 11 adjacent + 3
new agent-pool); 2 specs cover the persistence helpers and the
clobber-prevention contract.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-02 15:51:36 +02:00
Ettore Di Giacinto
7325046650 fix(diffusers): drop compel from requirements to unblock pip resolver (#9632)
compel 2.3.1 (latest, Nov 2025) declares transformers~=4.25 in its
metadata, i.e. >=4.25,<5.0. After transformers 5.0 (2026-01-26) and
huggingface-hub 1.0 (2025-10-27) shipped, the weekly DEPS_REFRESH
cache rotation in CI started seeing the new majors and pip's resolver
went into multi-hour backtracking storms walking every transformers
4.x candidate against every accelerate/hf-hub/tokenizers combination
to find a set compel would accept. The 2026-04-29 backend-build for
the diffusers backend (darwin-mps + l4t + cublas13-turboquant matrix
cells) hit the GitHub Actions 6h job timeout still inside pip
install — the build itself never started.

compel is the only hard upper bound on transformers in this stack
(diffusers, accelerate, peft, optimum-quanto are all flexible), and
upstream support for transformers 5 is still in flight: damian0815/
compel#129 ("Modernize Compel for Transformers 5") and #128 ("Bump
transformers version to >5.0") are both open as of today.

backend.py only constructs Compel() when COMPEL=1 is set in the env
(default off), so make compel a true optional extra:

  - Wrap the top-level `from compel import ...` in try/except
    ImportError, mirroring the existing sd_embed pattern.
  - Auto-disable COMPEL with a warning when the module isn't
    installed, instead of crashing on module load.
  - Drop compel from all eight requirements-*.txt variants so the
    resolver no longer has to satisfy its transformers cap.
  - Leave a TODO in backend.py and in each requirements file
    pointing at the upstream PR/issue, so the dependency can be
    reinstated once compel supports transformers >= 5.

Users who rely on weighted-prompt embeddings can opt in with a
manual `pip install compel` alongside COMPEL=1; the warning emitted
on startup tells them how.

Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-01 14:45:14 +02:00
Ettore Di Giacinto
8452068f43 feat(importers): whisper.cpp HF repos pick a quant + nest under whisper/models (#9630)
The WhisperImporter's Import() switch ordered LooksLikeURL ahead of the
HuggingFace branch, so any https://huggingface.co/<owner>/<repo> URI
(e.g. LocalAI-io/whisper-large-v3-it-yodas-only-ggml) hijacked the URL
path. FilenameFromUrl returned the repo slug, the gallery entry pointed
at the HTML repo page, the SHA256 was empty, and the HF file listing
was effectively dead code for HTTPS imports. The HF branch only fired
for huggingface://owner/repo and hf://owner/repo references.

Gate the URL case on a "ggml-*.bin" basename signal — mirroring how
the llama-cpp importer gates on ".gguf" — so direct file URLs still
take the URL path while HF repo URLs fall through to the HF branch.
There the file listing is actually consulted: every ggml-*.bin entry
is collected and one is picked by the new preferences.quantizations
preference (default q5_0; comma-separated for fallback ordering).

Pin the chosen file under whisper/models/<name>/<file> so a single
repo can ship q4_0/q5_0/q8_0 side-by-side without colliding on disk,
matching the llama-cpp/models/<name>/ layout. The fallback when no
preference matches is the last available ggml file, mirroring
llama-cpp's pickPreferredGroup behaviour.

Tests: replace the previous probe spec with positive assertions
against LocalAI-io/whisper-large-v3-it-yodas-only-ggml (default →
ggml-model-q5_0.bin, quantizations=q4_0 → ggml-model-q4_0.bin) plus
two offline specs that build a fake hfapi.ModelDetails to cover the
fallback rule and non-ggml filtering without touching the network.


Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-01 12:03:07 +02:00
ER-EPR
0b0078047f Add tags to qwen3-vl-reranker and Qwen3-VL-Embedding to the gallery (#9628)
* Add tags to Qwen3-VL-Reranker models

Added tags for reranker models in index.yaml.

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Add Qwen3-VL-Embedding models to gallery

Added Qwen3-VL-Embedding-8B and Qwen3-VL-Embedding-2B models with detailed descriptions and file references.

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

* Update index.yaml

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>

---------

Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>
2026-05-01 10:56:58 +02:00
Tai An
80961d2da6 feat(backends/python): use tempfile.gettempdir() instead of hardcoded /tmp (#9629)
Closes #9601

Makes the temporary scratch paths in vllm, vllm-omni, tinygrad, and pocket-tts
backends configurable via the standard TMPDIR env var, instead of always writing
to /tmp. This is a one-line change per call site that calls tempfile.gettempdir()
for the directory and keeps the same filename suffix.

Users who run on systems with a small root partition (or want to relocate scratch
files to a larger volume) can now redirect these by setting TMPDIR
(e.g. TMPDIR=/data/tmp), without affecting the existing LOCALAI_GENERATED_CONTENT_PATH
or LOCALAI_UPLOAD_PATH options that already cover other temp paths.

Files touched:
- backend/python/vllm/backend.py        (1 site: video base64 scratch)
- backend/python/tinygrad/backend.py    (1 site: image fallback dst)
- backend/python/pocket-tts/backend.py  (1 site: tts wav fallback dst)
- backend/python/vllm-omni/backend.py   (2 sites: video + audio scratch)
2026-05-01 10:56:24 +02:00
LocalAI [bot]
9c4c3f9d8f chore: ⬆️ Update ggml-org/llama.cpp to beb42fffa45eded44804a1fd4916146222371581 (#9624)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-01 02:02:56 +02:00
LocalAI [bot]
273416f54b chore: ⬆️ Update ikawrakow/ik_llama.cpp to a8aecbf15933295af96504f9a693998322185b5c (#9625)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-01 02:02:29 +02:00
Ettore Di Giacinto
c02a50f2ab feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618)
Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to
d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits
common_params_speculative into nested draft / ngram_simple / ngram_mod
sub-structs. Renames every grpc-server.cpp reference to match:

  speculative.mparams_dft.path  -> speculative.draft.mparams.path
  speculative.{n_max,n_min}     -> speculative.draft.{n_max,n_min}
  speculative.{p_min,p_split}   -> speculative.draft.{p_min,p_split}
  speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx}
  speculative.ngram_size_n      -> speculative.ngram_simple.size_n
  speculative.ngram_size_m      -> speculative.ngram_simple.size_m
  speculative.ngram_min_hits    -> speculative.ngram_simple.min_hits

The "speculative.n_max" JSON key sent to the upstream server stays
unchanged — server-task.cpp still reads it and routes the value into
draft.n_max internally.

The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched
before #22397 and still exposes the flat layout. Since turboquant
reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend
patch-grpc-server.sh with an idempotent sed block that reverts the
ten field references back to the legacy flat names on the build copy
only — the original under backend/cpp/llama-cpp/ stays compiling
against vanilla upstream. Drop the block once the fork rebases.

ik-llama-cpp has its own grpc-server.cpp with no speculative refs
(0/2661 lines), so it is unaffected.

Validated locally with `make docker-build-llama-cpp` (avx, avx2,
avx512, fallback, grpc + rpc-server all built; image exported).


Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-30 08:44:43 +02:00
LocalAI [bot]
76971fb2aa chore: ⬆️ Update leejet/stable-diffusion.cpp to 3d6064b37ef4607917f8acf2ca8c8906d5087413 (#9617)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-30 08:43:42 +02:00
LocalAI [bot]
ebd9fcbe20 chore(model gallery): 🤖 add 1 new models via gallery agent (#9615)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 22:33:41 +02:00
Ettore Di Giacinto
091eda8d70 feat: react chat redesign (#9616)
* feat(react-ui): redesign chat — popover history, focus on send, density pass

Replace the persistent 260px conversation sidebar with a Cmd/Ctrl+K
popover (ChatsMenu) so the conversation owns the page. Once a chat has
at least one message we auto-collapse the global app rail and fade
non-essential header chrome; Esc gives the user back the full chrome
for the rest of the session.

Move Canvas mode and the MCP dropdown into the input wrapper as mode
chips — they describe what's armed for the next message and now live
where the user composes. The chat header drops to Chats · title ·
ModelSelector · overflow · settings, and an overflow menu carries
admin-only Manage mode along with Info / Edit / Export / Clear.

Density pass: tighter header (40px), smaller avatars with the assistant
left-border accent doing the work, 88% bubble width, modern
field-sizing on the textarea, 32px send/stop buttons.

Empty state now surfaces a Recent strip (top 4 non-empty chats) and a
Cmd+K hint, replacing the discoverability the persistent sidebar used
to provide.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* feat(react-ui): chat input chips, slimmer menu, focus mode polish

Move Canvas mode and the MCP dropdown into the input wrapper as compact
mode chips — they describe what's armed for the next message and now
sit where the user composes. The MCP popover flips upward when anchored
to the input row so it stays on-screen.

Eliminate the chat header overflow ("…") menu entirely; relocate each
item to its semantic home so users don't have to remember a
miscellany drawer:

- Manage mode toggle → top of the Settings drawer, alongside the
  other sticky chat knobs. The shield next to the title still
  signals state at a glance.
- Model info / Edit config → small admin-only "ⓘ" button next to the
  ModelSelector; the existing model-info panel now hosts the Edit
  config link.
- Export as Markdown → per-row hover action in ChatsMenu, so it works
  for any chat (not just the active one).
- Clear chat history → destructive button at the bottom of the
  Settings drawer.

Make the Sidebar listen to its own `sidebar-collapse` event so the
chat's focus mode actually shrinks the rail (it previously only
flipped the layout class, leaving the sidebar element at full width
and overlapping the chat). Drop the focus-mode toast — the visual
shift is enough; the toast was noise.

Define `--color-text-tertiary` in both themes; without it metadata
text (recent strip timestamps and a few other sites) was inheriting
the platform default, which read as black on the dark surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(model/log-store): close merged channel exactly once; clean up Remove

Two latent races in BackendLogStore.Subscribe could panic under load
(distributed e2e test triggered "send on closed channel" at
backend_log_store.go:288):

1. The aggregated path closed the merged channel `ch` from two
   places — the fan-in waiter goroutine (after all source channels
   drained) and unsubscribe(). When unsubscribe ran while a fan-in
   goroutine was mid-flight on `ch <- line`, the close beat the send
   and the runtime panicked. Now `ch` is closed by exactly one
   goroutine: the waiter that observes all fan-in goroutines finish.
   unsubscribe() only closes the per-buffer source channels — the
   for-range in each fan-in goroutine then exits naturally and the
   waiter takes care of the merged close.

2. Remove() closed every subscriber channel but didn't delete the
   entries from the subscribers map, so a concurrent unsubscribe()
   would call close() again on the already-closed channel
   ("close of closed channel"). Clear the map entry while closing.

Add a regression test that hammers AppendLine concurrently with
Subscribe + unsubscribe + Remove; the race detector catches both
classes of regression.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(model/log-store): port backend log store tests to ginkgo

Bring backend_log_store_test.go in line with the rest of pkg/model
(loader_test, watchdog_test, store_test): same external test package
(`model_test`), same ginkgo + gomega imports, same Describe/It
nesting around the public API. Behaviour is unchanged — the four
existing scenarios plus the unsubscribe race regression all run as
specs under the existing `TestModel` suite.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 22:33:26 +02:00
Ettore Di Giacinto
fe6eb57082 feat(vibevoice-cpp): add purego TTS+ASR backend (#9610)
* feat(vibevoice-cpp): add purego TTS+ASR backend

Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new
purego-based Go backend that serves both Backend.TTS and
Backend.AudioTranscription from a single gRPC binary. Mirrors the
qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix
(cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the
e2e-backends gRPC harness reuse existing infrastructure.

- backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC
  Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test
- backend/index.yaml - &vibevoicecpp meta + 18 image entries
- Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring,
  test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers
- .github/workflows/backend.yml - matrix entries for all variants
- .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs

* feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries

Refactor backend Load() to follow the standard Options[] convention
used by sherpa-onnx and the rest of the multi-role backends:
ModelFile is the primary gguf, supplementary paths come through
opts.Options[] as key=value (or key:value for Make-target compat),
resolved against opts.ModelPath. type=asr/tts decides the role of
ModelFile when neither tts_model nor asr_model is set explicitly.

Add gallery/index.yaml entries:
- vibevoice-cpp     - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice
- vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer

Both pull from huggingface://mudler/vibevoice.cpp-models with sha256
verification. parameters.model + Options[] paths are siblings under
{models_dir} per the qwen3-tts-cpp convention.

Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon
style, and tighten the per-backend Go closed-loop test to use the
explicit Options API.

* fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive

libvibevoice is a STATIC archive linked into the MODULE library.
Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on
MSVC), the linker garbage-collects symbols not referenced from this
translation unit - which means dlopen+RegisterLibFunc panics with
'undefined symbol: vv_capi_load' at backend startup, since purego
looks them up by name and our cpp/govibevoicecpp.cpp doesn't call
them directly.

* test(vibevoice-cpp): rewrite suite with Ginkgo v2

Match the convention used by backend/go/sherpa-onnx/backend_test.go.
The suite now covers backend semantics that don't need purego (Locking,
empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top
of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR).
Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so
`go test ./backend/go/vibevoice-cpp/` is green on a clean checkout
and runs the heavyweight closed-loop spec when test.sh has staged
the bundle.

* fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream

The gRPC server's stream handlers (pkg/grpc/server.go) spawn a
goroutine that ranges over a chan; the only thing closing that chan
is the backend's own *Stream method. With the default Base stub
returning 'unimplemented' and never touching the chan, the server
goroutine hangs forever and the client hits DeadlineExceeded - which
is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts
matrix run.

TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a
streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can
start playback before the full PCM lands) followed by the PCM body
in 64 KB slices. The header + >=2 PCM frames satisfy the harness's
'expected >=2 chunks' assertion and give a real progressive stream.

AudioTranscriptionStream runs the offline transcription, emits each
segment as a delta, and closes with a final_result whose Text equals
the concatenated deltas (the harness asserts those match).

Two new Ginkgo specs guard the close-channel-on-error path so the
deadline-exceeded regression can't come back silently.

* fix(vibevoice-cpp): silence errcheck on cleanup paths

Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along
purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure
for defers that take args) - matches what the rest of the LocalAI
backend/go/* tree already does for these callsites.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution

Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced:

1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left
   v.ttsModel empty, because the default-fill block only ran when BOTH
   slots were empty. vv_capi_load then got tts="" + a voice and the
   C side rejected it with rc=-3 'TTS model required to load a voice'.
   Fix: ModelFile fills the *primary* role-slot (decided by 'type=' in
   Options, defaulting to tts) independently of the secondary, so
   ModelFile + asr_model resolves to both.

2. resolvePath stat'd CWD before falling back to relTo. With LocalAI
   launched from a directory that happens to contain a same-named
   file, supplementary Options[] paths could leak away from the
   models dir. Drop the CWD probe entirely - relative paths now
   *always* join onto opts.ModelPath (the gallery convention).

New Ginkgo coverage:
  * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr,
    explicit tts_model override, key:value variant.
  * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough,
    empty input, empty relTo, and the CWD-trap regression test.
  * 'Load resolves relative Options paths against opts.ModelPath' - end-
    to-end gallery layout round-trip.

Verified locally: 19/19 specs pass (with model bundle, including the
closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(vibevoice-cpp): use gallery convention in closed-loop spec

The 'loads the realtime TTS model' / closed-loop specs were passing
already-prefixed paths into Options[]:

    Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')]

Combined with no ModelPath set on the request, the backend's
modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then
resolvePath joined the prefixed Options path on top of it -
producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when
the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'.

The fix is to mirror the gallery contract LocalAI core actually
sends in production: ModelPath is the models root (absolute),
ModelFile is a name *under* it, every Options[] path is relative
to ModelPath. Uses filepath.Base() to get bare filenames.

Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs)
and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that
broke CI). Both: 19/19 specs pass, ~55-60s.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout

The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner
image, the docker build cache, and the test artifacts on a free
ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription'
was getting SIGTERM'd at 90 min before the model could finish loading.

Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for:
  * the e2e harness Make target
  * the gallery 'vibevoice-cpp-asr' entry (parameters + files block)
  * the per-backend test.sh auto-download list

Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from
90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs
runway above the previous 90 min cap.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners

The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on
disk) a single 30 s transcription saturates the per-test 30 min
timeout in the e2e-backends harness on a 4-core ubuntu-latest, and
the 10 GB download + Docker layer + working space leaves no headroom
on the runner's free disk. Two attempts in CI got SIGTERM'd at the
LoadModel boundary - the bottleneck isn't tunable from the workflow
side without a paid-tier runner.

The per-backend tests-vibevoice-cpp job already runs the same
AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same
gRPC contract, same model, single process - so the standalone
tests-vibevoice-cpp-grpc-transcription job was redundant on top of
the disk/CPU pressure.

The Makefile target test-extra-backend-vibevoice-cpp-transcription
stays for local invocation on workstations that can afford it -
useful when developing the streaming codepaths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner

Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to
the self-hosted 'bigger-runner' label that GPU image builds in
backend.yml use, plus the documented Free-disk-space prep step (purge
dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang
entries in this file describe. That gives the 7B-param Q4_K ASR
model the disk + CPU runway it needs.

Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK
decode plus 10 GB download has to fit comfortably.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e

bigger-runner is a self-hosted bare runner without the standard
ubuntu image's preinstalled build tools, so the previous job died at
the very first command with 'make: command not found' (exit 127).
Add the Dependencies step that the disabled vllm/sglang entries in
this file already document - apt-get installs make + build-essential
+ curl + unzip + ca-certificates + git + tar before the make target
runs. Mirrors how every other 'runs-on: bigger-runner' entry in
backend.yml prepares the runner.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 22:22:14 +02:00
LocalAI [bot]
13fe37df89 chore(model gallery): 🤖 add 1 new models via gallery agent (#9611)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 09:06:22 +02:00
Richard Palethorpe
4916f8c880 feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563)
* feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map

LocalAI's vLLM backend wraps a small typed subset of vLLM's
AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.).
Anything outside that subset -- pipeline/data/expert parallelism,
speculative_config, kv_transfer_config, all2all_backend, prefix
caching, chunked prefill, etc. -- requires a new protobuf field, a
Go struct field, an options.go line, and a backend.py mapping per
feature. That cadence is the bottleneck on shipping vLLM's
production feature set.

Add a generic `engine_args:` map on the model YAML that is
JSON-serialised into a new ModelOptions.EngineArgs proto field and
applied verbatim to AsyncEngineArgs at LoadModel time. Validation
is done by the Python backend via dataclasses.fields(); unknown
keys fail with the closest valid name as a hint.
dataclasses.replace() is used so vLLM's __post_init__ re-runs and
auto-converts dict values into nested config dataclasses
(CompilationConfig, AttentionConfig, ...). speculative_config and
kv_transfer_config flow through as dicts; vLLM converts them at
engine init.

Operators can now write:

  engine_args:
    data_parallel_size: 8
    enable_expert_parallel: true
    all2all_backend: deepep_low_latency
    speculative_config:
      method: deepseek_mtp
      num_speculative_tokens: 3
    kv_cache_dtype: fp8

without further proto/Go/Python plumbing per field.

Production defaults seeded by hooks_vllm.go: enable_prefix_caching
and enable_chunked_prefill default to true unless explicitly set.

Existing typed YAML fields (gpu_memory_utilization,
tensor_parallel_size, etc.) remain for back-compat; engine_args
overrides them when both are set.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel

vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't
load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130
simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin
vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and
includes the DFlash speculative-decoding method that landed in 0.20.0.

cublas13 install gets --index-strategy=unsafe-best-match so uv consults
both the cu130 index and PyPI when resolving — PyPI also publishes
vllm==0.20.0, but with cu12 binaries that error at import time.

Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat
completions on RTX 5070 Ti (sm_120, cu130).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(vllm): bot job to bump cublas13 vLLM wheel pin

vLLM's cu130 wheel index URL is itself version-locked
(wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM
bump means rewriting two values atomically — the URL segment and the
version constraint. bump_deps.sh handles git-sha-in-Makefile only;
add a sibling bump_vllm_wheel.sh and a matching workflow job that
mirrors the existing matrix's PR-creation pattern.

The bumper queries /releases/latest (which excludes prereleases),
strips the leading 'v', and seds both lines unconditionally. When the
file is already on the latest tag the rewrite is a no-op and
peter-evans/create-pull-request opens no PR.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* docs(vllm): document engine_args and speculative decoding

The new engine_args: map plumbs arbitrary AsyncEngineArgs through to
vLLM, but the public docs only covered the basic typed fields. Add a
short subsection in the vLLM section explaining the typed/generic
split and showing a worked DFlash speculative-decoding config, with
pointers to vLLM's SpeculativeConfig reference and z-lab's drafter
collection.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-29 00:49:28 +02:00
LocalAI [bot]
55afda22e3 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 453a027c17e4d63a7f16b871197a396240a65138 (#9608)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 00:18:19 +02:00
LocalAI [bot]
1fe3558ec6 feat(swagger): update swagger (#9607)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-29 00:18:02 +02:00
Ettore Di Giacinto
e370318bd7 fix(vllm): seed pybind11 for fastsafetensors build under --no-build-isolation
fastsafetensors==0.3 (transitive dep of vllm) imports pybind11 in
setup.py without declaring it in build-system.requires. With
--no-build-isolation it has to already exist in the venv, otherwise the
wheel build fails with ModuleNotFoundError on arm64 L4T CUDA 13 (and
any other profile that picks up vllm 0.20.0).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-28 20:08:26 +00:00
Richard Palethorpe
4443250756 chore: add golangci-lint with new-from-merge-base baseline (#9603)
* chore: add golangci-lint with new-from-merge-base baseline

Configure golangci-lint v2 with the standard linter set (errcheck, govet,
ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only
test convention from .agents/coding-style.md by rejecting stdlib testing
calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the
codebase has many pre-existing QF-style suggestions not worth gating on.

issues.new-from-merge-base = master makes the lint job a gate for new
issues only; the ~1300 pre-existing baseline stays visible via
'make lint-all' for incremental cleanup. CI runs 'make lint'.

Backends needing C/C++ headers we don't install in the lint runner are
excluded via a deny list in the Makefile (backend/go/{piper,silero-vad,
llm}, cmd/launcher). Discovery still flows through 'go list ./...', so
new packages are scanned automatically.

To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable,
move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp /
acestep-cpp). Without this 'go list' rejects the package because Go does
not allow .cpp alongside .go without cgo.

Fix two real bugs found by lint in tests/integration/ (run only via
'make test-stores', not default CI): a stale zerolog reference left over
from the slog migration (c37785b7) and an unused 'os' import.

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(lint): generate proto sources and fetch full history

The lint job was failing for two reasons:

- pkg/grpc/proto/*.go is generated, not checked in. Several packages
  import it, so without 'make protogen-go' typecheck fails project-wide
  with "no required module provides package github.com/mudler/LocalAI/
  pkg/grpc/proto".

- golangci-lint's new-from-merge-base needs to git-merge-base the PR
  against master, but actions/checkout's default shallow clone doesn't
  fetch master. fetch-depth: 0 brings full history; the config now
  references origin/master (the remote-tracking branch that survives
  the shallow checkout) instead of bare master (which doesn't exist
  locally after checkout).

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* ci(lint): stub react-ui/dist for go:embed glob

core/http/app.go has //go:embed react-ui/dist/*. The glob must match at
least one non-hidden entry or typecheck fails the whole core/http
package. We don't need the real React bundle to lint Go code, so just
touch an empty index.html to satisfy the embed.

Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-28 22:07:44 +02:00
Ettore Di Giacinto
bcef72b9c1 feat: localai assistant chat modality (#9602)
* fix(tests): inline model_test fixtures after tests/models_fixtures removal

The previous reorg removed tests/models_fixtures/ but core/config/model_test.go
still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so
`make test` failed with "open : no such file or directory" on the readConfigFile
spec (the suite ran with --fail-fast and bailed before openresponses_test).

Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test
file, materialise them into a per-test tmpdir via BeforeEach, and drop the
env-var lookups. The test no longer depends on Makefile plumbing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]

* refactor(modeladmin): extract model-admin helpers into a service package

Lift the bodies of EditModelEndpoint, PatchConfigEndpoint,
ToggleStateModelEndpoint, TogglePinnedModelEndpoint and
VRAMEstimateEndpoint into core/services/modeladmin so the same logic can
be called by non-HTTP clients (notably the in-process MCP server that
backs the LocalAI Assistant chat modality, landing in a follow-up commit).

The HTTP handlers shrink to thin shells that parse echo inputs, call the
matching helper, map typed errors (ErrNotFound, ErrConflict,
ErrPathNotTrusted, ErrBadAction, ...) to the existing HTTP status codes,
and render the existing response shapes. No REST-surface behaviour change;
the existing localai endpoint tests cover the regression net.

Adds focused unit tests for each helper against tmp-dir-backed
ModelConfigLoader fixtures (deep-merge patch, rename + conflict, path
separator guard, toggle/pin enable/disable, sync callback).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(assistant): LocalAI Assistant chat modality with in-memory MCP server

Adds a chat modality, admin-only, that wires the chat session to an
in-memory MCP server exposing LocalAI's own admin/management surface as
tools. An admin can install models, manage backends, edit configs and
check status by chatting; the LLM calls tools like gallery_search,
install_model, import_model_uri, list_installed_models, edit_model_config
and surfaces the results.

Same Go package powers two modes:

  pkg/mcp/localaitools/

    NewServer(client, opts) builds an MCP server that registers the
    19-tool admin catalog. The LocalAIClient interface has two impls:

    - inproc.Client — calls services directly (no HTTP loopback,
      no synthetic admin API key). Used in-process by the chat handler.
    - httpapi.Client — calls the LocalAI REST API. Used by the new
      `local-ai mcp-server --target=…` subcommand to control a remote
      LocalAI from a stdio MCP host.

    Tools and their embedded skill prompts are agnostic to which client
    backs them. Skill prompts are markdown files under prompts/, embedded
    via go:embed and assembled into the system prompt at server init.

Wiring:

  - core/http/endpoints/mcp/localai_assistant.go — process-wide holder
    that spins up the in-memory MCP server once at Application start
    using paired net.Pipe transports, then reuses LocalToolExecutor
    (no fork) for every chat request that opts in.

  - core/http/endpoints/openai/chat.go — small branch ahead of the
    existing MCP block: when metadata.localai_assistant=true,
    defense-in-depth admin check + executor swap + system-prompt
    injection. All downstream tool dispatch is unchanged.

  - core/http/auth/{permissions,features}.go — adds
    FeatureLocalAIAssistant; gating happens at the chat handler entry
    plus admin-only `/api/settings`.

  - core/cli/{run.go,cli.go,mcp_server.go} —
    LOCALAI_DISABLE_ASSISTANT flag (runtime-toggleable via Settings, no
    restart), plus `local-ai mcp-server` stdio subcommand.

  - core/config/runtime_settings.go — `localai_assistant_enabled`
    runtime setting; the chat handler reads `DisableLocalAIAssistant`
    live at request entry.

UI:

  - Home.jsx — prominent self-explanatory CTA card on first run
    ("Manage LocalAI by chatting"); collapses to a compact
    "Manage by chat" button in the quick-links row once used,
    persisted via localStorage.
  - Chat.jsx — admin-only "Manage" toggle in the chat header,
    "Manage mode" badge, dedicated empty-state copy, starter chips.
  - Settings.jsx — "LocalAI Assistant" section with the runtime
    enable toggle.
  - useChat.js — `localaiAssistant` flag on the chat schema; injects
    `metadata.localai_assistant=true` on requests when active.

Distributed mode: the in-memory MCP server lives only on the head node;
inproc.Client wraps already-distributed-aware services so installs
propagate to workers via the existing GalleryService machinery.

Documentation: `.agents/localai-assistant-mcp.md` is the contributor
contract — when adding an admin REST endpoint, also add a LocalAIClient
method, an inproc + httpapi impl, a tool registration, and a skill
prompt update; the AGENTS.md index links to it.

Out of scope (follow-ups): per-tool RBAC granularity for non-admin
read-only access; streaming mcp_tool_progress for long installs;
React Vitest rig for the UI changes.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): extract tool/capability/MiB/server-name constants

The MCP tool surface, capability tag set, server-name default, and the
chat-handler metadata key were repeated as bare string literals across
seven files. Renaming any one required hand-editing every call site and
risked code/test/prompt drift.

This pulls them into typed constants:

- pkg/mcp/localaitools/tools.go — Tool* constants for the 19 MCP tools,
  plus DefaultServerName.
- pkg/mcp/localaitools/capability.go — typed Capability + constants for
  the capability tag set the LLM passes to list_installed_models. The
  type rides through LocalAIClient.ListInstalledModels and replaces the
  triplet of "embed"/"embedding"/"embeddings" with the single
  CapabilityEmbeddings.
- pkg/mcp/localaitools/inproc/client.go — bytesPerMiB constant for the
  VRAMEstimate byte→MB conversion.
- core/http/endpoints/mcp/tools.go — MetadataKeyLocalAIAssistant for the
  "localai_assistant" request-metadata key consumed by the chat handler.

Tool registrations, the test catalog, the dispatch table, the validation
fixtures, and the fake/stub clients all reference the constants. The
embedded skill prompts under prompts/ keep their bare strings (go:embed
markdown can't import Go constants); the existing TestPromptsContain
SafetyAnchors guards the alignment.

No behaviour change. All tests pass with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(modeladmin): typed Action for ToggleState/TogglePinned

The toggle/pin verbs were bare strings everywhere — handler signatures,
service implementations, MCP tool args, the fake/stub clients, the
inproc and httpapi LocalAIClient impls, plus 4 test files. A typo in
any caller silently fell through to the runtime "must be 'enable' or
'disable'" check.

Introduce core/services/modeladmin.Action (string alias) with
ActionEnable, ActionDisable, ActionPin, ActionUnpin and a small Valid
helper. The compiler now catches mismatches at every boundary; renames
ripple through one source of truth.

LocalAIClient.ToggleModelState/Pinned signatures change to take
modeladmin.Action. The package is brand-new and unreleased so this is
a free public-API tightening.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): respect ctx cancellation on gallery channel sends

InstallModel, DeleteModel, ImportModelURI, InstallBackend and
UpgradeBackend all pushed onto galleryop channels with bare sends. If the
worker was paused or the buffer full, the chat-handler goroutine blocked
forever — the LLM kept polling and the request leaked.

Wrap the five sends in a sendModelOp/sendBackendOp helper that selects
on ctx.Done() so a cancelled chat completion surfaces context.Canceled
back to the LLM instead of hanging.

Adds inproc/client_test.go with a pre-cancelled-ctx regression test on
InstallModel; the helpers are shared so the same guarantee covers the
other four call sites.

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): graceful shutdown for in-memory holder and stdio CLI

Two related leaks:

- Application.start() built the LocalAIAssistantHolder but never wired
  Close() into the graceful-termination chain — the in-memory MCP
  transport pair stayed alive until process exit, and the goroutines
  behind net.Pipe() didn't drain. Hook into the existing
  signals.RegisterGracefulTerminationHandler chain (same pattern as
  core/http/endpoints/mcp/tools.go:770).

- core/cli/mcp_server.go ran srv.Run with context.Background(); a
  Ctrl-C from the host (Claude Desktop, mcphost, npx inspector) or a
  SIGTERM from process supervision left the stdio loop reading from a
  closed pipe. Switch to signal.NotifyContext to surface the signal
  through ctx and let srv.Run drain.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(assistant): typed HTTPError + propagate prompt walk error

The httpapi client detected "no such job" by substring-matching on the
error string ("404", "could not find") — brittle to status-code
formatting changes and to LocalAI fixing /models/jobs/:uuid to return a
proper 404. Replace with a typed *HTTPError whose Is() method honours
errors.Is(err, ErrHTTPNotFound). The 500-with-"could not find" branch
stays as a transitional fallback documented in Is().

Same change covers ListNodes' 404 fallback for the /api/nodes endpoint.

Adds httptest tests for both 404 and the legacy 500 path, plus a
direct errors.Is exposure test so external callers (the standalone
stdio CLI host) can match without re-string-parsing.

Also tightens prompts.SystemPrompt: panic when fs.WalkDir on the
embedded FS fails. The only realistic cause is a build-time //go:embed
misconfiguration; serving an empty system prompt to the LLM is much
worse than crashing init. TestSystemPromptIncludesAllEmbeddedFiles
catches regressions in CI.

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(modeladmin): atomic writes for model config files

The five sites that wrote model YAML used os.WriteFile, which opens
with O_TRUNC|O_WRONLY|O_CREATE. A crash mid-write left the destination
truncated and the model unloadable until manual repair. Pre-existing
behaviour inherited from the original endpoint handlers — fix once now
that there's a single helper.

Adds writeFileAtomic: writes to a sibling temp file, chmods, syncs via
Close(), then os.Rename. Same-directory temp keeps the rename atomic on
the same filesystem; cleanup runs on every error path so stray temps
don't accumulate. No new dependency.

Applied to:
- ConfigService.PatchConfig
- ConfigService.EditYAML (both rename and in-place branches)
- mutateYAMLBoolFlag (drives ToggleState + TogglePinned)

atomic_test.go covers the happy path plus a read-only-dir failure case
that asserts the original file is preserved (skipped on Windows where
the chmod trick is POSIX-specific).

Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(assistant): prune dead code, mark stub, document conventions

Three small cleanups landing together:

- Drop the unused errNotImplemented sentinel from inproc/client.go.
  All five methods that used to return it are wired to modeladmin
  helpers since the Phase B commit; the package var is dead.

- Annotate httpapi.Client.GetModelConfig as a known stub. LocalAI's
  /models/edit/:name returns rendered HTML, not JSON, so the standalone
  CLI's get_model_config tool surfaces a clear error to the LLM. A
  future JSON-only /api/models/config-yaml/:name endpoint is tracked in
  the agent contract; FIXME points at it.

- Extend `.agents/localai-assistant-mcp.md` with a "Code conventions"
  section that documents the audit-driven rules: tool/Capability/Action
  constants, errors.Is over substring matching, ctx-aware channel
  sends, atomic writes, and graceful shutdown. Refresh the file map so
  it lists tools.go and capability.go and drops the removed
  tools_bootstrap.go.

The tools_models.go diff is a comment-only change explaining why the
ModelName empty-string check stays at the tool layer (consistency
across LocalAIClient implementations, since the SDK schema validator
only enforces presence, not non-empty).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(assistant): convert test files to ginkgo + gomega

The repo convention (per core/http/endpoints/localai/*_test.go,
core/gallery/**, etc.) is Ginkgo v2 with Gomega assertions. The tests I
introduced for the assistant feature used vanilla testing.T, which made
them stand out and stripped the BDD structure the rest of the suite
relies on.

Convert every test file in the assistant scope to Ginkgo:

  pkg/mcp/localaitools/
    dto_test.go            — Describe("DTOs round-trip through JSON")
    prompts_test.go        — Describe("SystemPrompt assembler")
    server_test.go         — Describe("Server tool catalog"),
                              Describe("Tool dispatch"),
                              Describe("Tool error surfacing"),
                              Describe("Argument validation"),
                              Describe("Concurrent tool calls")
    parity_test.go         — Describe("LocalAIClient parity"),
                              hosts the suite's single RunSpecs (the file
                              is package localaitools_test so it can
                              import httpapi without an import cycle;
                              Ginkgo aggregates Describes from both the
                              internal and external test packages into
                              one run).
    httpapi/client_test.go — Describe("httpapi.Client against the
                              LocalAI admin REST surface"),
                              Describe("ErrHTTPNotFound"),
                              Describe("Bearer token")
    inproc/client_test.go  — Describe("inproc.Client cancellation")

  core/services/modeladmin/
    config_test.go         — Describe("ConfigService") with sub-Describes
                              for GetConfig, PatchConfig, EditYAML
    state_test.go          — Describe("ConfigService.ToggleState")
    pinned_test.go         — Describe("ConfigService.TogglePinned")
    atomic_test.go         — Describe("writeFileAtomic")

  core/http/endpoints/mcp/
    localai_assistant_test.go — Describe("LocalAIAssistantHolder")

Each package gets a `*_suite_test.go` with the standard
`RegisterFailHandler(Fail) + RunSpecs(t, "...")` boilerplate. Helpers
that previously took *testing.T (newTestService, writeModelYAML,
readMap, sortedStrings, sortGalleries, etc.) drop the *T receiver and
use Gomega Expectations directly. tmp dirs come from GinkgoT().TempDir().

No semantic change to test coverage — every original assertion has a
direct Gomega counterpart. All suites pass with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test+docs(assistant): drift detector for Tool ↔ REST route mapping

Honest gap from the audit: the parity_test.go suite only checks four
methods, and uses the same httpapi.Client for both sides — it asserts
stability of the DTO shapes, not equivalence between in-process and
HTTP. If a contributor adds an admin REST endpoint without an MCP tool,
or a tool without a matching httpapi route, both surfaces silently
diverge.

Add a coverage test plus stronger docs:

- pkg/mcp/localaitools/coverage_test.go introduces a hand-maintained
  toolToHTTPRoute map: every Tool* constant must list the REST endpoint
  the httpapi.Client hits (or "(none)" with a documented reason). Two
  Ginkgo specs assert the map and the published catalog stay in sync —
  one fails when a Tool is added without a route entry, the other fails
  when a route entry references a tool that no longer exists. Verified
  by removing the ToolDeleteModel entry locally; the test fired with a
  clear message pointing the contributor at the file.

  Deliberate non-test: we don't enumerate live admin REST routes from
  here. Walking the route registry requires booting Application;
  parsing core/http/routes/localai.go is brittle. The "new admin REST
  endpoint → MCP tool" direction stays a PR checklist item — see below.

- AGENTS.md gets a new Quick Reference bullet that calls out the rule
  and points at the test by name.

- .agents/api-endpoints-and-auth.md tightens the existing "Companion:
  MCP admin tool surface" subsection from "if useful, consider..." to
  "MUST be considered, with three concrete outcomes (tool added,
  deliberately skipped with documented reason, or forgot — which
  breaks the contract)". Adds a checklist item at the bottom of the
  file's authoritative checklist.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): drop duplicate DTOs, surface canonical types

Audit feedback: localaitools/dto.go reinvented several types that already
existed in the codebase. Replace the duplicates with the canonical types
so the LLM-visible wire format stays aligned with the rest of LocalAI by
construction (no parallel structs to keep in sync).

Removed (and the canonical type now used by the LocalAIClient interface):

  localaitools.Gallery          → config.Gallery
  localaitools.GalleryModelHit  → gallery.Metadata
  localaitools.VRAMEstimate     → vram.EstimateResult

Tightened scope:

  localaitools.Backend          → kept, but reduced to {Name, Installed}.
                                  ListKnownBackends now returns
                                  []schema.KnownBackend (the canonical
                                  type already used by REST /backends/known).

Kept with documented rationale:

  localaitools.JobStatus       — galleryop.OpStatus has Error error which
                                 marshals to "{}". JobStatus is the
                                 JSON-friendly mirror.
  localaitools.Node            — nodes.BackendNode carries gorm internals
                                 + token hash; we expose only the
                                 LLM-relevant fields.
  ImportModelURIRequest/Response — schema.ImportModelRequest and
                                   GalleryResponse are wire-shaped, mine
                                   are LLM-shaped (BackendPreference flat,
                                   AmbiguousBackend exposed).

Side wins:

  - Drop bytesPerMiB; vram.EstimateResult already carries human-readable
    display strings (size_display, vram_display) the LLM uses directly.
  - Drop the handler-private vramEstimateRequest in
    core/http/endpoints/localai/vram.go and bind directly into
    modeladmin.VRAMRequest (now JSON-tagged).

Both clients pass through these types now where possible (e.g.
ListGalleries in inproc.Client is a one-liner returning
AppConfig.Galleries; httpapi.Client.GallerySearch decodes straight into
[]gallery.Metadata).

All tests green with -race.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(assistant): extract REST route paths into named constants

httpapi.Client had 18 bare-string path sites scattered across methods.
Pull them into pkg/mcp/localaitools/httpapi/routes.go: static paths as
package-private constants, dynamic paths as small builders that handle
url.PathEscape on segment values.

No behaviour change. Drops the now-unused net/url import from client.go
since path escaping moved into routes.go alongside the path it applies to.

Local-only by design: the server-side registrations in
core/http/routes/localai.go remain bare strings. Sharing constants across
the pkg/ ↔ core/ boundary would invert the layering today; the existing
Tool↔REST drift-detector in coverage_test.go is the safety net for that
direction.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

* docs(assistant): align with shipped UI and dropped bootstrap env vars

The LocalAI Assistant doc still described the older iteration:

- The in-chat toggle was renamed from "Admin" to "Manage" (the badge is
  now "Manage mode" and the home page exposes a "Manage by chat" CTA).
- LOCALAI_ASSISTANT_BOOTSTRAP_MODEL / --localai-assistant-bootstrap-model
  and the bootstrap_default_model tool were removed — admins pick a model
  from the existing selector instead, no env-var configuration required.
- The shipped tool catalog includes import_model_uri but didn't appear in
  the doc; bootstrap_default_model appeared but no longer exists.
- The Settings → LocalAI Assistant runtime toggle wasn't mentioned as the
  preferred way to disable without restart.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-28 19:29:27 +02:00
Ettore Di Giacinto
142919fc79 fix(tests): inline model_test fixtures after tests/models_fixtures removal
The previous reorg removed tests/models_fixtures/ but core/config/model_test.go
still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so
`make test` failed with "open : no such file or directory" on the readConfigFile
spec (the suite ran with --fail-fast and bailed before openresponses_test).

Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test
file, materialise them into a per-test tmpdir via BeforeEach, and drop the
env-var lookups. The test no longer depends on Makefile plumbing.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
2026-04-28 12:58:49 +00:00
dependabot[bot]
439471baec chore(deps): bump packaging from 24.1 to 26.2 in /backend/python/coqui (#9594)
Bumps [packaging](https://github.com/pypa/packaging) from 24.1 to 26.2.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/packaging/compare/24.1...26.2)

---
updated-dependencies:
- dependency-name: packaging
  dependency-version: '26.2'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:53 +02:00
dependabot[bot]
eff4be6794 chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.1 to 2.28.2 (#9593)
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/onsi/ginkgo/compare/v2.28.1...v2.28.2)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.28.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:41 +02:00
dependabot[bot]
f1ec30d646 chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres from 0.41.0 to 0.42.0 (#9591)
chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres

Bumps [github.com/testcontainers/testcontainers-go/modules/postgres](https://github.com/testcontainers/testcontainers-go) from 0.41.0 to 0.42.0.
- [Release notes](https://github.com/testcontainers/testcontainers-go/releases)
- [Commits](https://github.com/testcontainers/testcontainers-go/compare/v0.41.0...v0.42.0)

---
updated-dependencies:
- dependency-name: github.com/testcontainers/testcontainers-go/modules/postgres
  dependency-version: 0.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-28 08:44:27 +02:00
LocalAI [bot]
f3500223d7 chore: ⬆️ Update leejet/stable-diffusion.cpp to a81677f59c92d90343aebca51dfed7decf0a0cb0 (#9586)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:44:10 +02:00
LocalAI [bot]
b69bacfcdc chore: ⬆️ Update ikawrakow/ik_llama.cpp to d6f3e4e28fbf75e6181e6ea32e734de9ce9304fd (#9585)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:43:51 +02:00
LocalAI [bot]
8e50066fa2 chore: ⬆️ Update ggml-org/llama.cpp to 665abc609740d397d30c0d8ef4157dbf900bd1a3 (#9584)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-28 08:43:33 +02:00
Ettore Di Giacinto
a0317d9926 refactor(tests): split app_test.go, move real-backend coverage to e2e-backends
core/http/app_test.go had grown to 1495 lines exercising three concerns at
once: HTTP-layer integration, real-backend inference (llama-gguf, tts,
stablediffusion, transformers embeddings, whisper), and service logic that
already has unit-level coverage. Each PR paid for 6 backend builds plus
real-model downloads to satisfy a single suite.

Reorg per layer:

- app_test.go (1495 -> 1003 lines) drives the mock-backend binary only.
  Kept: auth, routing, gallery API, file:// import, /system, agent-jobs
  HTTP plumbing, config-file model loading. Deleted real-inference specs
  (llama-gguf chat, ggml completions/streaming, logprobs, logit_bias,
  transcription, embeddings, External-gRPC, Stores duplicate, Model gallery
  Context). Lifted Agent Jobs out of the deleted Stores Context.
- tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and
  no-first-token-dup specs (the latter folded into PredictStream). Two
  new caps gate them so non-LLM backends opt out.
- tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text")
  to catch container-level streaming regressions.
- tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml.
  app_test.go now writes per-Context inline mock-model YAMLs.

CI:

- test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md,
  backend/) so docs and backend-only PRs skip them. test.yml drops the
  6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple
  drops the llama-cpp-darwin build.
- New tests-aio.yml runs the AIO container nightly + on workflow_dispatch
  + master/tags. The tests-e2e-container job moved out of test.yml so PRs
  no longer pay AIO cost.
- New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with
  no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends:
  master-cpu-llama-cpp (no build on PR) and exercises predict/stream/
  logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance
  real-backend gate after AIO moved to nightly. The path-gated heavy
  test-extra-backend-llama-cpp wrapper appends the same caps so it
  exercises the moved specs when the backend actually changes.

Makefile:

- Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf,
  test-tts, test-stablediffusion, test-realtime-models. test target
  drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR,
  FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on
  build-mock-backend. test-stores keeps a focused entry point and depends
  on backends/local-store. clean-tests also clears the mock-backend
  binary.

Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) +
~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min
llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the
always-on workflows entirely.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]
2026-04-27 23:09:20 +00:00
Ettore Di Giacinto
3948b580d2 fix(distributed): worker stopBackend/isRunning resolve bare modelID to replica keys
PR #9583 changed the supervisor's process map key from `modelID` to
`modelID#replicaIndex`, but the NATS lifecycle handlers kept passing
the bare modelID:

* `backend.stop` (subscribeLifecycleEvents): `s.stopBackend(req.Backend)`
  → `s.processes["Qwen3.6-..."]` missed (actual key is "...#0") →
  silent no-op. Admin "Unload model" clicks released VRAM via
  model.unload but left the gRPC process alive on its old port.
  Subsequent chats hit installBackend, found the leftover process,
  reused its address — and the UI reported "no models loaded" while
  the model kept responding.

* `backend.delete` (subscribeLifecycleEvents): same map miss in
  `isRunning(req.Backend)` and `s.stopBackend(req.Backend)` — admin
  "Delete backend" deleted the binary while the process was still
  serving traffic.

Add `resolveProcessKeys(id)`: exact match if `id` is a full processKey
(stopAllBackends iterates the map and passes its own keys);
prefix-match if `id` is bare (NATS handlers); empty if `id` contains
`#` but doesn't match (no spurious fallback when the caller was
explicit). stopBackend and isRunning now call it; stopBackend gets a
new stopBackendExact helper for per-key cleanup.

TDD: regression test fails without the fix (resolveProcessKeys
doesn't exist; map lookup by bare name returns nothing). Tests pass
post-fix.

Reproduced live: registry row count was 0 for the model the user
"Unloaded", chat still served by the leftover worker process.
SmartRouter behavior is correct in itself — it falls through to
scheduleAndLoad when no row exists; the bug was that the leftover
process corrupted the install path.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]
2026-04-27 21:43:15 +00:00
LocalAI [bot]
5efbe8405f feat(swagger): update swagger (#9587)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-27 23:28:03 +02:00
Ettore Di Giacinto
ea1df8945b fix(distributed): preserve UI-added node labels across worker re-register
The register endpoint called SetNodeLabels(req.Labels) — replace-all
semantics — so every worker re-register wiped every label not in the
worker's body. The bug existed since labels were introduced in
PR #9186 (Mar 31), but only triggered for workers that supplied
labels via --node-labels.

PR #9583 (the multi-replica refactor) added an auto-mirrored
`node.replica-slots` label to every worker's registration body, which
made `len(req.Labels) > 0` always true — turning a latent edge-case
bug into a universal one. Operators reported "labels assigned to
node do not persist": labels survived until the next worker restart,
then disappeared.

Fix: iterate req.Labels and call SetNodeLabel (upsert) for each
instead of SetNodeLabels (delete-then-recreate). Worker-managed
labels still refresh on re-register; UI-added labels survive.

Trade-off: an operator who removes a label from --node-labels won't
have it auto-removed from the DB on next register — they can clean it
via the UI. Acceptable, since the alternative (current behavior)
silently destroys operator state.

Regression test added first (TDD): RegisterNodeEndpoint registers a
node, the test simulates a UI add via SetNodeLabel, then re-registers
with a different worker label set; assertion that the UI-added label
survives. Test fails against the broken code, passes against the fix.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]
2026-04-27 21:24:50 +00:00
Ettore Di Giacinto
3280b9a287 fix(distributed): per-replica backend logs (store aggregation + UI)
The multi-replica refactor (PR #9583) changed the worker's process key
from `modelID` to `modelID#replicaIndex`, but the BackendLogStore kept
the bare-modelID lookup. Result: every distributed deployment lost
backend logs in the Nodes UI — single-replica too, since even the
default capacity of 1 produces a `#0` suffix.

Two changes wired together:

* pkg/model: BackendLogStore.GetLines/Subscribe now treat a modelID
  without `#` as a model prefix and merge across all `modelID#N` replica
  buffers (timestamp-sorted for GetLines; fan-in for Subscribe). Calls
  with a full `modelID#N` key resolve exactly. ListModels strips
  replica suffixes and deduplicates so the listing surfaces one entry
  per loaded model.

* react-ui: per-replica log streams as the default. Loaded Models
  table disambiguates each row with a `rep N` pill (only when the node
  hosts >1 replica of a model). Each row's "View logs" link routes to
  the per-replica process key so operators see only that replica's
  output. The logs page renders the replica context as a chip in the
  title and surfaces a segmented control — `Replica 0 / 1 / … / All
  merged` — when the model has multiple replicas; the merged segment
  uses the bare-modelID URL (delegating to the store's prefix
  aggregation) for the side-by-side comparison case. Single-replica
  deployments see no extra UI.

Tests added first (TDD): the regression set in
backend_log_store_test.go reproduces the bug at the exact failure
point — GetLines/ListModels/Subscribe assertions all fail against the
broken code, all pass against the fix. TestSubscribe_PerReplicaFilter
pins the exact-key path so a future change can't silently break it.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:distill]
2026-04-27 20:55:24 +00:00
Ettore Di Giacinto
375bf1929d fix(ui): hide meta-dev backends in System → Backends Development toggle
The Manage view's flagsFor() short-circuited on b.IsMeta and returned
dev=false for every meta backend, so meta-dev entries
(e.g. llama-cpp-development, whisper-development, insightface-development)
leaked through the Development toggle in distributed mode and stayed
visible whether the toggle was on or off. The count chip even
under-reported because those rows were excluded from it.

Drop the IsMeta short-circuit and trust gallery enrichment for both
flags. Production metas (llama-cpp) are tagged isAlias=false /
isDevelopment=false in the gallery so they still pass both toggles;
meta-dev entries carry isDevelopment=true and now correctly hide
alongside concrete dev variants.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 20:38:20 +00:00
Ettore Di Giacinto
9a7f5e68bd ci(darwin): add native caches to backend_build_darwin
macOS runners can't use the registry-backed BuildKit cache (no Docker
daemon), so every darwin matrix run was paying full cost for brew
installs, Go module downloads, llama.cpp recompiles and Python wheel
resolution.

Wires actions/cache@v4 into the reusable workflow for four caches:

- Go modules + build cache (setup-go cache: true), shared across matrix
- Homebrew downloads + selected /opt/homebrew/Cellar entries, with
  HOMEBREW_NO_AUTO_UPDATE so restored Cellar paths stay stable
- ccache for the llama-cpp CMake variants, keyed on the pinned
  LLAMA_VERSION; CMAKE_*_COMPILER_LAUNCHER is exported via GITHUB_ENV
  so backend/cpp/llama-cpp/Makefile picks it up without script changes
- Python uv + pip wheel cache, keyed by backend + ISO week — same
  one-cold-rebuild-per-week cadence as the Linux DEPS_REFRESH

Read/write semantics match the existing BuildKit policy: every run
restores, only master/tag pushes save, so PRs can't pollute master's
warm cache.

Documents the new caches and the macOS-specific constraints in
.agents/ci-caching.md.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]
2026-04-27 20:17:36 +00:00
Ettore Di Giacinto
6b63b47f61 feat(distributed): support multiple replicas of one model on the same node (#9583)
* feat(distributed): support multiple replicas of one model on the same node

The distributed scheduler implicitly assumed `(node_id, model_name)` was
unique, but the schema didn't enforce it and the worker keyed all gRPC
processes by model name alone. With `MinReplicas=2` against a single
worker, the reconciler "scaled up" every 30s but the registry never
advanced past 1 row — the worker re-loaded the model in-place every tick
until VRAM fragmented and the gRPC process died.

This change introduces multi-replica-per-node as a first-class concept,
with capacity-aware scheduling, a circuit breaker, and VRAM
soft-reservation. Operators can declare per-node capacity via the worker
flag `--max-replicas-per-model` (mirrored as auto-label
`node.replica-slots=N`) or override per-node from the UI.

* Schema: BackendNode gains MaxReplicasPerModel (default 1) and
  ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id +
  model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for
  the reconciler circuit breaker.

* Registry: replica_index threaded through SetNodeModel, RemoveNodeModel,
  IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel,
  SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers:
  CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot),
  RemoveAllNodeModelReplicas, FindNodesWithFreeSlot,
  ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with
  ErrInsufficientVRAM), and the unsatisfiable-flag CRUD.

* Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads
  of the same model land on distinct ports. Adds CLI flag
  --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1)
  and emits the auto-label.

* Router: scheduleNewModel filters candidates by free slot, allocates the
  replica index, and soft-reserves VRAM before installing the backend.
  evictLRUAndFreeNode now deletes the targeted row by ID instead of all
  replicas of the model on the node — fixes a latent bug where evicting
  one replica orphaned its siblings.

* Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig
  (MinReplicas > capacity) doesn't loop forever. After 3 consecutive
  ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and
  emits a warning. ClearAllUnsatisfiable fires from Register,
  ApproveNode, SetNodeLabel(s), RemoveNodeLabel and
  UpdateMaxReplicasPerModel so a new node joining or label changes wake
  the reconciler immediately. scaleDownIdle removes highest-replica-index
  first to keep slots compact.

* Heartbeat resets reserved_vram to 0 — worker is the source of truth
  for actual free VRAM; the reservation is only for the in-tick race
  window between two scheduling decisions.

* Probe path (reconciler.probeLoadedModels and health.doCheckAll) now
  pass the row's replica_index to RemoveNodeModel so an unreachable
  replica doesn't orphan healthy siblings.

* Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a
  sticky override (preserved across worker re-registration). DELETE
  clears the override so the worker's flag applies again on next
  register. Required because Kong defaults the worker flag to 1, so
  every worker restart would have silently reverted the UI value.

* React UI: always-visible slot badge on the node row (muted at default
  1, accented when >1); inline editor in the expanded drawer with
  pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when
  the value is admin-set, and a "Reset" button to hand control back to
  the worker. Soft confirm when shrinking the cap below the count of
  loaded replicas. Scheduling rules table gets an "Unsatisfiable until
  HH:MM" status badge surfacing the cooldown.

* node.replica-slots filtered out of the labels strip on the row to
  avoid duplicating the slot badge.

23 new Ginkgo specs (registry, reconciler, inflight, health) cover:
multi-replica row independence, RemoveNodeModel of one replica
preserving siblings, NextFreeReplicaIndex slot allocation including
ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping
and recovery on Register, scheduleDownIdle ordering, ClusterCapacity
math, ReserveVRAM admission gating, Heartbeat reset, override survival
across worker re-registration, and ResetMaxReplicasPerModel handing
control back. Plus 8 stdlib tests for the worker processKey / CLI /
auto-label.

Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor
worker (single 128 GiB node, MinReplicas=2): the reconciler now caps
the scale-up at the cluster's actual capacity instead of looping.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing]

* refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions

* Capacity editor hint trimmed from operator-doc-style ("Sourced from
  the worker's `--max-replicas-per-model` flag. Changing it here makes it
  a sticky admin override that survives worker restarts." → "Saved
  values stick across worker restarts.") and the override-state copy
  similarly compressed. The full mechanic is no longer needed in the UI
  — the override pill carries the meaning and the docs cover the rest.

* Node row actions migrated from an inline cluster of icon buttons
  (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for
  per-row model actions, so dense Nodes tables stay clean. Approve
  stays as a prominent primary button — it's a stateful admission gate,
  not a routine action, and elevating it matches how /manage surfaces
  install-time decisions outside the menu.

* The expanded drawer's Labels section now filters node.replica-slots
  out of the editable label list. The label is owned by the Capacity
  editor above; surfacing it again as an editable label invited
  confusion (the Capacity save would clobber any direct edit).

Both backend and agent workers benefit — they share the row rendering
path, so the action menu and label filter apply to both.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish]

* fix(react-ui/nodes): suppress slot badge on agent workers

Agent workers don't load models, so the per-node replica capacity is
inapplicable to them. Showing "1× slots" on agent rows was a tiny
inconsistency from the unified rendering path — gate the badge on
node_type !== 'agent' so it only appears on backend workers.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp]

* refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form

The expanded node drawer used to stack five panels — slot badge,
filled capacity box, Loaded Models h4+empty-state, Installed Backends
h4+empty-state, Labels h4+chips+form — making routine inspections feel
like a control panel. The scheduling rule form wrapped its mode toggle
as two 50%-width filled buttons that competed visually with the actual
primary action.

* Drawer: collapse three rarely-touched config zones (Capacity,
  Backends, Labels) into one `<details>` "Manage" disclosure (closed by
  default) with small uppercase eyebrow labels for each zone instead of
  parallel h4 sub-headings. Loaded Models stays as the at-a-glance
  headline with a single-line empty hint instead of a boxed empty state.
  CapacityEditor renders flat (no filled background) — the Manage
  disclosure provides framing.

* Scheduling form: replace the chunky 50%-width button-tabs with the
  project's existing `.segmented` control (icon + label, sized to
  content). Mode hint becomes a single tied line below. Fields stack
  vertically with helper text under inputs and a hairline divider above
  the right-aligned Save / Cancel.

The empty drawer collapses from ~5 stacked sections (~280px tall) to
two lines (~80px). The scheduling form now reads as a designed dialog
instead of raw building blocks. Both surfaces now match the typographic
density and weight of the rest of the admin pages.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish]

* feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox

The native <select> made operators scroll through every gallery entry to
find a model name. The project already has SearchableModelSelect (used
in Studio/Talk/etc.) which combines free-text search with the gallery
list and accepts typed model names that aren't installed yet — useful
for pre-staging a scheduling rule before the node it'll run on has
finished bootstrapping.

Also drops the now-unused useModels import (the combobox manages the
gallery hook internally).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit]

* refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips

The Nodes page was rendering the same key=value chip pattern in two
places with subtly different markup: the Labels editor in the expanded
drawer and (post-distill) the Node Selector input in the scheduling
form. The form's input was also a comma-separated string that operators
were getting wrong.

* Extract <KeyValueChips> as a fully controlled chip-builder. Parent
  owns the map and decides what onAdd/onRemove does — form state for the
  scheduling form, API calls for the live drawer Labels editor. Same
  visuals everywhere; one component to change when polish needs apply.

* Replace the comma-separated Node Selector text input with KeyValueChips.
  Operators were copying syntax from docs and missing commas; the chip
  vocabulary makes the key=value structure self-documenting.

* Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max
  replicas. Picked over a slider because replica counts are exact specs
  derived from VRAM math (operator decision, not a fuzzy estimate). The
  chips give one-click access to common values (1/2/3/4 for Min,
  0=no-limit/2/4/8 for Max) without the slider's special-value problem
  (MaxReplicas=0 is categorical, not a position on a continuum).

* Drop the now-unused labelInputs state in the Nodes page (the inline
  label editor's per-node draft state lived there and is now owned by
  KeyValueChips).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill]

* test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright)

Two breakages caught by CI that didn't surface in the local run:

* tests/e2e/distributed/*.go — multiple files used the pre-PR2 registry
  signatures for SetNodeModel / IncrementInFlight / DecrementInFlight /
  RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo
  and one stale adapter.InstallBackend call in node_lifecycle_test.go.
  All updated to pass replicaIndex=0 — these tests don't exercise
  multi-replica behavior, they just need to compile against the new
  signatures. The chip-builder tests in core/services/nodes/ already
  cover the multi-replica logic.

* core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the
  drawer's distill refactor moved Backends inside a "Manage" <details>
  disclosure that's collapsed by default. The test helper expanded the
  node row but never opened Manage, so the per-node backend table was
  never in the DOM. Helper now clicks `.node-manage > summary` after
  expanding the row.

All 100 playwright tests pass locally; tests/e2e/distributed compiles
clean.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 21:20:05 +02:00
Ettore Di Giacinto
f4036fa83f ci(python-backends): add weekly DEPS_REFRESH cache-buster
The shared backend/Dockerfile.python ends in:
    RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
which `pip install`s each backend's requirements*.txt. A scan of all 34
Python backends shows every single one ships at least some unpinned deps
(torch, transformers, vllm, diffusers, ...). With the registry cache now
enabled, that `make` layer's BuildKit hash depends only on Dockerfile
instructions + COPYed source — not on what pip resolves at runtime — so
a warm cache would freeze upstream versions indefinitely.

DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml
computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as
a build-arg, so the install layer invalidates at most once per week and
re-resolves PyPI / nightly indexes. Within a week, builds stay warm.

Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock)
already lock their deps, and the C++ backends pull gRPC at a pinned tag
and llama.cpp at a pinned commit.

Add .agents/ci-caching.md documenting the cache layout
(quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics
(master writes, PRs read-only), DEPS_REFRESH semantics, and how to
manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
2026-04-27 14:21:11 +00:00
Ettore Di Giacinto
3810fe1a1e fix(distributed): worker container healthcheck always unhealthy
The Dockerfile's HEALTHCHECK probes http://localhost:8080/readyz, which
is the OpenAI API server port. When the same image runs as a worker, it
listens on the gRPC base port (50051) and an HTTP file transfer server
on port-1 (50050) — nothing on 8080 — so docker always reports the
container as unhealthy.

Add unauthenticated /readyz and /healthz endpoints to the worker's HTTP
file transfer server, and override HEALTHCHECK_ENDPOINT for worker-1 in
the distributed compose file. Disable the healthcheck for agent-worker
since it is NATS-only and exposes no HTTP server.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 13:51:57 +00:00
Ettore Di Giacinto
bdfa5e934a ci: switch image/backend build cache to a dedicated registry image
- Switch cache-from/cache-to in backend_build.yml and image_build.yml
  from the unused gha cache to type=registry pointing at
  quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with
  ignore-error=true. Master/tag builds populate their own
  per-matrix-entry cache; PR builds read-only.
- Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc`
  Dockerfile stage that was removed by b1fc5acd in July 2025, has been
  failing every night since, and never populated the gha cache. The new
  registry-cache scheme is self-warming, so no separate populator is
  needed.
- Remove the dead GRPC_VERSION / GRPC_BASE_IMAGE / GRPC_MAKEFLAGS
  build-args from image_build.yml and the orphan ARG GRPC_BASE_IMAGE in
  the root Dockerfile (the root Dockerfile no longer compiles gRPC; the
  source build now lives in backend/Dockerfile.{llama-cpp,
  ik-llama-cpp, turboquant} only and uses its own ARG defaults).
- Drop the unused grpc-base-image input from image_build.yml plus the
  matrix passthroughs in image.yml / image-pr.yml.
- Drop the unused GRPC_VERSION env in test.yml.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m
2026-04-27 13:13:04 +00:00
Richard Palethorpe
deca6dbdad feat: Log backend exit code (#9581)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-27 14:19:18 +02:00
Ettore Di Giacinto
60549a8a60 feat(react-ui): page-width archetype system + mobile/tablet nav polish
Replace the universal max-width:1200px cap on .page with a four-tier
archetype system (narrow 760, medium 1080, default 1600, wide unbounded)
selected per page based on what its UX actually wants. Data/table pages
fill ultrawide displays; forms cap at reading width; tabbed feature
surfaces breathe.

Mobile/tablet:
- New 640/1024 breakpoint split. Tablets (640-1023) get a persistent
  52px icon rail; below 640 keeps the slide-off drawer.
- Drawer polish: body-scroll lock, Escape to close, focus moves into
  the drawer on open and back to the hamburger on close, aria-hidden
  + inert on main while open.
- Mobile top bar carries hamburger + theme toggle + account avatar
  (44x44 touch targets) so theme/account aren't trapped in the drawer.
- Page-level reflow on phones: page-header column-stacks, filter chips
  scroll horizontally, tables go edge-to-edge, OperationsBar overflows
  rather than wrapping. Honors prefers-reduced-motion.

Manage > Models: drop the toggle column; Enable/Disable joins the
per-row Actions menu alongside Stop/Pin/Edit/Logs/Delete for
consistency with the other action verbs.

Page-width tokens live in theme.css so future tuning is one line.
Removes 7 inline maxWidth workarounds from page roots.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]
2026-04-27 11:51:29 +00:00
Ettore Di Giacinto
54728e292f feat(react-ui): split Manage backends toggle into Variants and Development
Meta backends are now always shown — they're the entries operators
configure against — and two independent toggles govern the noise around
them. "Variants" hides platform-specific concrete builds that a meta
backend aliases on the host (e.g. llama-cpp-cuda12-12.4). "Development"
hides pre-release `-development` builds. Each toggle shows the count of
items currently hidden in its category. The legacy `bm` URL flag is
honored on read so existing deep-links resolve to the same view they
used to.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-27 08:23:53 +00:00
Tai An
86fd62233f fix(gallery): correct Qwen3.5 typo in qwen3.5-27b-claude-4.6 model override (closes #9362) (#9580)
The overrides.parameters.model field referenced 'Qwen3.-27B-Claude-...' (missing the '5'), so model loads failed because the configured filename did not match the file actually downloaded by the entry's files: list ('Qwen3.5-27B-Claude-...').

Aligns the override filename with the files: entries and with the upstream HF repo (mradermacher/Qwen3.5-27B-...).
2026-04-27 09:24:00 +02:00
Alex Brick
41ed8ced70 [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) (#9578)
Update additional intel base images
2026-04-27 09:18:57 +02:00
LocalAI [bot]
05e94bd9e7 chore: ⬆️ Update ggml-org/llama.cpp to f53577432541bb9edc1588c4ef45c66bf07e4468 (#9577)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-27 08:57:24 +02:00
Ettore Di Giacinto
8d124d080f feat(gallery): add whisper-development umbrella stanza
Mirrors the whisper capabilities map with -development variants so
clients can pull the master-tagged whisper.cpp backend via a single
platform-resolved name, matching the existing faster-whisper-development
and whisperx-development entries.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-26 23:04:27 +00:00
Ettore Di Giacinto
2da1a4d230 feat(distributed): per-node backend installation from the gallery
In distributed mode the Backends gallery used to fan every install out to
every worker — fine for auto-resolving (meta) backends like llama-cpp where
each node picks its own variant, but wrong for hardware-specific builds
like cpu-llama-cpp that would silently land on every GPU node.

Adds a node-targeted install path through the existing
POST /api/nodes/:id/backends/install plumbing, with two entry points:

- Backends gallery row gets a split-button in distributed mode. Auto-
  resolving keeps "Install on all nodes" as the primary; chevron menu
  opens the picker. Hardware-specific routes the primary directly to the
  picker — no fan-out path on the row.
- Nodes-page drawer gets a "+ Add backend" button that navigates to
  /app/backends?target=<node-id>; the gallery scopes itself to that node
  (banner, single per-row install button, Reinstall/Remove for already-
  installed). One gallery, two scopes — no second UI to maintain.

The picker (new NodeInstallPicker) shows a 3-state suitability column
(Compatible / Override / Installed), an auto-expanding variant override
disclosure that fires when selected nodes have no working GPU, parallel
per-node installs with inline status and Retry-failed-nodes, and a
mismatch confirm that names the consequence on the button itself.

A 409 fan-out guard on /api/backends/apply protects CLI/Terraform/script
users from the same footgun: hardware-specific installs in distributed
mode now return code "concrete_backend_requires_target" with a human-
readable error and a meta_alternative pointer.

The gallery list payload now surfaces capabilities, metaBackendFor and
per-row nodes (NodeBackendRef) so the picker and the new Nodes column
have everything they need without re-walking the gallery client-side.

GODEBUG=netdns=go is set on the compose services because the cgo DNS
resolver follows the container's nsswitch.conf to host systemd-resolved
(127.0.0.53), unreachable from inside the container; the pure-Go
resolver reads /etc/resolv.conf directly and uses Docker's embedded DNS.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7[1m] [Edit] [Bash] [Read] [Write]
2026-04-26 22:05:18 +00:00
Ettore Di Giacinto
988430c850 test(react-ui): drive Manage page Backend logs link via the new kebab menu
Manage page row actions moved into ActionMenu in b336d9c6, so the
inline `<a title="Backend logs">` the e2e specs were asserting on no
longer exists. Open the row's kebab and assert against the menuitem.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
2026-04-26 20:51:01 +00:00
Ettore Di Giacinto
b336d9c626 feat(react-ui): polish Manage page with kebab menus and gallery rows
Bring the System / Manage page up to the visual standard of the Install
gallery so installed models and backends stop reading like a debug dump.

- Unified ResourceRow anatomy (icon, name+description, badges, status,
  expandable detail) shared across both tabs.
- Gallery enrichment cross-references installed names against the gallery
  list endpoints to surface icons, descriptions, license, tags, and links
  with a graceful "no description" fallback for custom imports.
- Header summary with four StatCards (Models / Backends / Running /
  Updates) — clickable to switch tab + pre-set filter.
- Backends meta + development entries hidden by default; "Show meta &
  development" paired toggle in the FilterBar with hidden-count hint.
- Kebab (three-dot) ActionMenu replaces the inline button cluster on
  every row; restrained until hover, keyboard-navigable, danger items
  separated by a divider.
- Backend "Version" cell now falls back to short digest, OCI tag, or
  ocifile basename when no semver is set, instead of showing "—" for
  every OCI install. Detail panel exposes full Source URI + Digest.
- Drop redundant column headers ("Actions", "On") — kebabs and toggles
  carry their own affordance; screen readers still get a label.
- Inline System / User / Meta / Dev badges next to the backend name so
  the dedicated Type column doesn't reserve space for "USER" repeated.
- Tightened the spacing between the System Resources card and the
  StatCards so they no longer crowd the RAM bar.

Extracted StatCard and GalleryLoader from Nodes.jsx and Models.jsx into
shared components so the visual language is one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude Code:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
2026-04-26 20:33:49 +00:00
Ettore Di Giacinto
f384c64a91 fix(model-loader): also skip .ckpt, .zip, and .tag files when scanning models
The local model directory scan treats every non-skipped file as a model
config candidate. Sidecar artifacts that ship alongside checkpoints
(checkpoint blobs, downloaded archives, ggml-style tag files) were
slipping through and showing up as bogus models in the listing. Add
their extensions to the suffix-skip list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:37:53 +00:00
Ettore Di Giacinto
e9d8e92988 fix(react-ui): don't yank chat scroll to bottom while user is reading
The chat and agent-chat pages auto-scrolled to the bottom on every
streamed token. If the user scrolled up to re-read part of a response,
the next chunk pulled them back down — making long replies unreadable
while streaming.

Track a stickToBottomRef on each scroll event: if the user is within
80px of the bottom we keep auto-scrolling, otherwise we leave them
where they are. On chat switch we snap back to the bottom and re-pin.

Same fix applied to both Chat.jsx and AgentChat.jsx since they share
the same streaming pattern.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
Ettore Di Giacinto
5b0196c7d0 fix(whisper): scrub invalid UTF-8 from segment text before protobuf marshal
whisper.cpp can emit bytes that are not valid UTF-8 — typically a
multibyte codepoint split across token boundaries. protobuf string
fields reject those at marshal time, which would surface as a transcribe
failure. Run strings.ToValidUTF8 on the segment text before it leaves
the cgo boundary so the bad byte gets replaced with U+FFFD.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
Ettore Di Giacinto
c8d63a1003 fix(react-ui): stop Manage page from blanking on auto-refresh; show real model use cases
- useModels.refetch now runs silently — distributed-mode 10s auto-refresh
  no longer flips loading=true and replaces the table with a spinner card.
- Manage Use Cases column derives badges from each model's actual
  capabilities (Chat / Image / TTS / Embeddings / etc.) instead of
  hardcoding a "Chat" link for every row.
- FilterBar right slot is right-aligned via margin-left:auto so the
  Update button lives at the end of the row, not next to the chips.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-26 19:35:39 +00:00
LocalAI [bot]
d9cb0d6133 chore: ⬆️ Update ggml-org/llama.cpp to dcad77cc3b0865153f486327064fb0320a57a476 (#9572)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 12:38:35 +02:00
LocalAI [bot]
f5c268deac chore: ⬆️ Update TheTom/llama-cpp-turboquant to 11a241d0db78a68e0a5b99fe6f36de6683100f6a (#9571)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 12:38:25 +02:00
Tai An
8931a2ad31 fix(gallery): normalize inconsistent tag casing/plurals across gallery models (#9574)
- embeddings → embedding (6 models): aligns with the WebUI filter button
  defined in core/http/views/models.html ({ term: 'embedding', ... }), so
  models like nomic-embed-text-v1.5 now appear under the Embedding filter
- TTS → tts (5 models), ASR → asr (2 models): lowercase, per existing
  convention used by 161+ models
- CPU/Cpu → cpu (17 models), GPU → gpu (17 models): lowercase, per existing
  convention used by 666+ models
- dedupe duplicate tag entries on 3 models that already had repeated tags
  (gpt-oss-20b had gguf x2; arcee-ai/AFM-4.5B had gpu x2; one Qwen model
  had default x2)

Closes #9247
2026-04-26 08:33:38 +02:00
Ettore Di Giacinto
e16e758dff ci(backends): build cpu-whisperx and cpu-faster-whisper for linux/arm64 (#9573)
Extend the existing CPU build matrix entries to produce a multi-arch
manifest (linux/amd64,linux/arm64) at the same image tags. arm64
Linux hosts without an NVIDIA GPU report the "default" capability,
which already maps to cpu-whisperx / cpu-faster-whisper in
backend/index.yaml -- so the manifest list lets Docker pull the right
variant without any gallery changes.

Both stacks install cleanly under aarch64: torch (2.4.1/2.8.0),
faster-whisper, ctranslate2, whisperx, opencv-python and the
remaining deps all ship manylinux2014_aarch64 wheels, so no source
builds run under QEMU emulation.

Follows the same pattern already used by cpu-llama-cpp-quantization.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-26 08:30:03 +02:00
LocalAI [bot]
1c45227346 chore: ⬆️ Update ikawrakow/ik_llama.cpp to 3a945af45d45936341a45bbf7deda56776a4af26 (#9570)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-26 08:26:37 +02:00
Ettore Di Giacinto
fbe4f0a99b fix(docs): replace Docsy alert shortcode with Relearn notice
The docs site uses the hugo-theme-relearn theme, which provides
`notice` instead of Docsy's `alert`. The face-recognition,
voice-recognition, and stores feature pages used `{{% alert %}}`,
breaking `hugo build` with "template for shortcode \"alert\" not
found".

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 21:04:31 +00:00
Ettore Di Giacinto
d733c9cd13 fix(mlx-vlm): pin upstream to v0.4.4 to unblock CUDA builds (#9568)
Blaizzy/mlx-vlm git HEAD bumped its constraint to mlx>=0.31.2, but
mlx-cuda-12 and mlx-cuda-13 are only published up to 0.31.1 on PyPI.
Since mlx[cudaXX]==0.31.2 forces a sibling wheel that doesn't exist,
pip backtracks through every older mlx[cudaXX], none of which satisfy
mlx>=0.31.2, producing ResolutionImpossible.

Pin all variants to the v0.4.4 tag (mlx>=0.30.0), which resolves
cleanly against mlx[cuda13]==0.31.1. cpu/mps weren't broken yet but
are pinned for consistency.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 22:06:01 +02:00
Ettore Di Giacinto
703b4fcae8 Change cron schedule to run every 12 hours
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-25 18:38:28 +02:00
Richard Palethorpe
73aacad2f9 fix(vllm): drop flash-attn wheel to avoid torch 2.10 ABI mismatch (#9557)
The pinned flash-attn 2.8.3+cu12torch2.7 wheel breaks at import time
once vllm 0.19.1 upgrades torch to its hard-pinned 2.10.0:

  ImportError: .../flash_attn_2_cuda...so: undefined symbol:
  _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_ib

That C10 CUDA symbol is libtorch-version-specific. Dao-AILab has not yet
published flash-attn wheels for torch 2.10 -- the latest release (2.8.3)
tops out at torch 2.8 -- so any wheel pinned here is silently ABI-broken
the moment vllm completes its install.

vllm 0.19.1 lists flashinfer-python==0.6.6 as a hard dep, which already
covers the attention path. The only other use of flash-attn in vllm is
the rotary apply_rotary import in
vllm/model_executor/layers/rotary_embedding/common.py, which is guarded
by find_spec("flash_attn") and falls back cleanly when absent.

Also unpin torch in requirements-cublas12.txt: the 2.7.0 pin only
existed to give the flash-attn wheel a matching torch to link against.
With flash-attn gone, vllm's own torch==2.10.0 dep is the binding
constraint regardless of what we put here.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-25 15:38:13 +00:00
LocalAI [bot]
806ea24ff4 chore: ⬆️ Update TheTom/llama-cpp-turboquant to 67559e580b10e4e47e9a6fd6218873997976886d (#9497)
⬆️ Update TheTom/llama-cpp-turboquant

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 14:03:46 +02:00
LocalAI [bot]
385de3705e chore(model gallery): 🤖 add 1 new models via gallery agent (#9558)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 14:03:15 +02:00
Ettore Di Giacinto
21eace40ec feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560)
Adds split_mode (alias sm) to the llama.cpp backend options allowlist,
accepting none|layer|row|tensor. The tensor value targets the experimental
backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and
requires a llama.cpp build that includes that PR, FlashAttention enabled,
KV-cache quantization disabled, and a manually set context size.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 14:02:57 +02:00
Ettore Di Giacinto
24505e57f5 feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang (#9553)
* feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang

Adds new build profiles mirroring the diffusers/ace-step pattern so vLLM
serving (and SGLang on arm64) can be deployed on CUDA 13 hosts and
JetPack 7 boards:

- vllm: cublas13 (PyPI cu130 channel) + l4t13 (jetson-ai-lab SBSA cu130
  prebuilt vllm + flash-attn).
- vllm-omni: cublas13 + l4t13. Floats vllm version on cu13 since vllm
  0.19+ ships cu130 wheels by default and vllm-omni tracks vllm master;
  cu12 path keeps the 0.14.0 pin to avoid disturbing existing images.
- sglang: l4t13 arm64 only — uses the prebuilt sglang wheel from the
  jetson-ai-lab SBSA cu130 index, so no source build is needed.
  Cublas13 sglang on x86_64 is intentionally deferred.

CI matrix gains five new images (-gpu-nvidia-cuda-13-vllm{,-omni},
-nvidia-l4t-cuda-13-arm64-{vllm,vllm-omni,sglang}); backend/index.yaml
gains the matching capability keys (nvidia-cuda-13, nvidia-l4t-cuda-13)
and latest/development merge entries.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]

* fix(backends): use unsafe-best-match index strategy on l4t13 builds

The jetson-ai-lab SBSA cu130 index lists transitive deps (decord, etc.)
at limited versions / older Python ABIs. uv defaults to the first index
that contains a package and refuses to fall through to PyPI, so sglang
l4t13 build fails resolving decord. Mirror the existing cpu sglang
profile by setting --index-strategy=unsafe-best-match on l4t13 across
the three backends, and apply it to the explicit vllm install line in
vllm-omni's install.sh (which doesn't honor EXTRA_PIP_INSTALL_FLAGS).

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]

* fix(sglang): drop [all] extras on l4t13, floor version at 0.5.0

The [all] extra brings in outlines→decord, and decord has no aarch64
cp312 wheel on PyPI nor the jetson-ai-lab index (only legacy cp35-cp37
tags). With unsafe-best-match enabled, uv backtracked through sglang
versions trying to satisfy decord and silently landed on
sglang==0.1.16, an ancient version with an entirely different dep
tree (cloudpickle/outlines 0.0.44, etc.).

Drop [all] so decord is no longer required, and floor sglang at 0.5.0
to prevent any future resolver misfire from degrading the version
again.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-25 12:26:29 +02:00
LocalAI [bot]
d09706dc60 chore(model gallery): 🤖 add 1 new models via gallery agent (#9555)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 09:00:37 +02:00
LocalAI [bot]
08e393f7db chore: ⬆️ Update ikawrakow/ik_llama.cpp to cb58a561f0c49f68b6d125cdfda037ed80433821 (#9549)
⬆️ Update ikawrakow/ik_llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 08:59:48 +02:00
LocalAI [bot]
47cc3dc8d7 chore: ⬆️ Update ggml-org/llama.cpp to 361fe72acb7b9bd79059cc177cbeda99b35b5db9 (#9548)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-25 08:58:27 +02:00
Ettore Di Giacinto
83b384de97 feat: surface distributed backend management errors (#9552)
* fix(distributed): surface per-node backend op errors to OpStatus

DistributedBackendManager.{Install,Upgrade,Delete}Backend discarded the
per-node BackendOpResult from enqueueAndDrainBackendOp with `_, err :=`.
When workers replied Success=false (e.g. an OCI image with no arm64
variant on a Jetson host), the per-node Error string was recorded in
result.Nodes[].Error but never reached the toplevel return value, so
OpStatus.Error stayed empty and the UI reported the install as
"completed" while the backend was nowhere on the cluster.

Add BackendOpResult.Err() that aggregates per-node Status=="error"
entries into a single error. Queued nodes (waiting for reconciler retry)
are deliberately not treated as failures. Wire the three callers and
DeleteBackendDetailed to call result.Err() so reply.Success=false
finally reaches OpStatus.Error → /api/backends/job/:uid → the UI.

The Delete closures had a related bug: they discarded the reply with
`_` and only checked the NATS round-trip error, so reply.Success=false
was a silent success even with the new aggregation. Check both.

Standalone mode (LocalBackendManager) already surfaces gallery errors
correctly through the same OpStatus.Error path; no change needed there.

Tests: 9 new Ginkgo specs covering all-success / all-fail with distinct
errors / mixed / all-queued / no-nodes for Install, Upgrade, Delete.

Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]

* feat(react-ui): per-node backend delete + clearer upgrade affordance

The Nodes page exposed a per-node "reinstall" button (fa-sync-alt,
tooltip "Reinstall backend") but no per-node delete, even though the
Go side has had POST /api/nodes/:id/backends/delete →
RemoteUnloaderAdapter.DeleteBackend → NATS-to-specific-node wired up
for a while. Sync icons read as "refresh data" — the action is
functionally an upgrade (re-pulls the gallery image), so the affordance
was misleading.

Per-node backend row now renders two icon buttons:

- Upgrade: btn-secondary btn-sm + fa-arrow-up, tooltip "Upgrade backend
  on this node". Names both action and scope to differentiate from the
  cluster-wide upgrade on the Backends page.
- Delete: btn-danger-ghost btn-sm + fa-trash, tooltip "Delete backend
  from this node". Matches the node-level destructive style at the row
  action column rather than the solid btn-danger of primary destructive
  pages, since this is a secondary action inside a busy row.

Delete goes through the existing ConfirmDialog (danger=true) with copy
that names the backend and the node explicitly — it's a non-recoverable
op on a specific scope. Reuses nodesApi.deleteBackend(id, backend) which
already existed in the API client.

Tests: 4 new Playwright specs covering upgrade clarity (icon + tooltip),
delete button presence, confirm dialog flow with POST body assertion,
and cancel-doesn't-POST.

Assisted-by: Claude:claude-opus-4-7 [Bash] [Edit] [Read] [Write]
2026-04-25 08:57:59 +02:00
Ettore Di Giacinto
487e3fd2a4 feat(react-ui): editorial refresh with Nord palette and polished primitives (#9550)
* feat(react-ui): editorial refresh with Nord palette and polished primitives

Replaces the cool gray-blue theme with a deep Nord-inspired palette:
frost-cyan accent (#88c0d0) on deep blue-black surfaces (#13171f /
#1a1f2a / #242a36), snow-storm text scale, aurora status colours.

- Typography: Geist Variable + Geist Mono Variable (Google Fonts) with
  ss01/ss03/cv11 stylistic alternates; strengthened h1-h6 hierarchy;
  editorial negative tracking.
- Primitives: buttons gain depth (inset highlight + hover lift +
  brightness filter); inputs become sunken wells with sage-swap-to-frost
  focus rings; cards hover-lift and gain an .card--accent left-rail
  variant; badges become mono caps rectangles with tabular-nums.
- Chrome: sidebar active state is now an inset left rail + tint
  (no border-left); modals get popIn animation and proper shadow lift;
  toasts carry an inset accent bar + slide-in instead of tinted fills;
  operations bar breathes on active installs.
- Empty states: editorial pattern (eyebrow rule, large mono title,
  52ch lede) that inherits gracefully even without page JSX edits.
- Chat: assistant bubbles drop the gray-nested-in-gray card for a
  transparent pull-quote with a left border; user bubbles soften from
  loud accent fill to a subtle frost tint.
- Motion: custom spring easing cubic-bezier(0.22,1,0.36,1), 180ms
  standard; breathing/pulse/popIn keyframes; global prefers-reduced-
  motion honoring.
- Radii tightened to 3/5/8/10px; warm-shadow tokens redone for cool
  depth; ::selection, :focus-visible, kbd globals added.
- Migrated hardcoded 'JetBrains Mono' CSS literals to var(--font-mono)
  so the Geist Mono swap lands everywhere.

Scope is intentionally tokens + primitives only. Page JSX and the
~1,800 inline style={{…}} instances are untouched and flagged as
follow-ups.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write]

* feat(react-ui): complete-coverage pass — migrate inline styles to tokens

Follows up the editorial/Nord token refresh with a mechanical sweep of
page JSX and shared components so nothing bypasses the design system.

- Font family: replaced 80+ 'JetBrains Mono' / 'Space Grotesk' inline
  literals (and the string-CSS variants in CollectionDetails and
  AgentStatus) with var(--font-mono) / var(--font-sans). SVG <text>
  nodes that used the attribute form were switched to style={{ }} so
  the CSS variable resolves.
- Radii: every unquoted numeric borderRadius (2/3/4/10) is now a
  var(--radius-*) token; 50% and 999px kept as computed shapes.
- Spacing: clean-token gaps and margins (4/8/16px) moved to
  var(--spacing-xs/sm/md); padding: '4px 8px' and '8px 16px' lifted
  into token pairs. Micro-values (2/6/10/12px) left inline where no
  token maps cleanly.
- Colors: Talk.jsx button/canvas-surface hardcodes moved to
  var(--color-*); FineTune.jsx chart series colours now use the
  --color-data-* Nord palette (cyan/red/purple/orange instead of
  tailwind hex); AgentStatus tool-call icon and error tag hex swapped
  for var(--color-warning) / var(--color-text-inverse).
- CodeMirror editor (utils/cmTheme.js): both themes rebased on Nord —
  polar-night surfaces and aurora syntax highlighting (dark), snow-
  storm surfaces with darkened aurora (light). Caret/selection/active
  line/search now frost-cyan tinted instead of legacy indigo/purple.

Legitimately dynamic styles (computed widths, per-row colours, canvas
2D context fill/stroke for waveform and spectrogram drawing) remain
inline — they can't be expressed as CSS tokens.

29 files, +237/-237 — identity preserved, semantics re-anchored to
the token system.

Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write]
2026-04-24 23:35:59 +02:00
dependabot[bot]
9ab3496de2 chore(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /backend/rust/kokoros in the cargo group across 1 directory (#9546)
chore(deps): bump rustls-webpki

Bumps the cargo group with 1 update in the /backend/rust/kokoros directory: [rustls-webpki](https://github.com/rustls/webpki).


Updates `rustls-webpki` from 0.103.10 to 0.103.13
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.13
  dependency-type: indirect
  dependency-group: cargo
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 22:02:58 +02:00
dependabot[bot]
c4511be33a chore(deps): bump postcss from 8.5.8 to 8.5.10 in /core/http/react-ui in the npm_and_yarn group across 1 directory (#9544)
chore(deps): bump postcss

Bumps the npm_and_yarn group with 1 update in the /core/http/react-ui directory: [postcss](https://github.com/postcss/postcss).


Updates `postcss` from 8.5.8 to 8.5.10
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/8.5.8...8.5.10)

---
updated-dependencies:
- dependency-name: postcss
  dependency-version: 8.5.10
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-24 22:02:41 +02:00
Ettore Di Giacinto
551ebdb57a fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545)
Workers on NVIDIA unified-memory hardware (DGX Spark / GB10, Jetson AGX Thor,
Jetson Orin/Xavier/Nano) were reporting `available_vram=0` back to the frontend,
so the Nodes UI showed the node as fully used even when most of the unified
memory was actually free.

Three causes addressed:

* `isTegraDevice` only matched `/sys/devices/soc0/family == "Tegra"`. DGX Spark
  (SBSA) reports JEDEC codes there instead — `jep106:0426` for the NVIDIA
  manufacturer — so the Tegra/unified-memory fallback never ran. Renamed to
  `isNVIDIAIntegratedGPU` and extended to also match `jep106:0426[:*]` via
  `/sys/devices/soc0/soc_id`.

* The unified-iGPU code defaulted the device name to `"NVIDIA Jetson"` when
  `/proc/device-tree/model` was missing. That's what happens for Thor inside a
  docker container, and always on DGX Spark. New `nvidiaIntegratedGPUName`
  resolves via dt-model → `/sys/devices/soc0/machine` → `soc_id` lookup
  (`jep106:0426:8901` → `"NVIDIA GB10"`) so the Nodes UI labels the box
  correctly.

* Worker heartbeat sent `available_vram=0` (or total-as-available) when VRAM
  usage was momentarily unknown — e.g. when `nvidia-smi` intermittently failed
  with `waitid: no child processes` under containers without `--init`. Each
  such heartbeat overwrote the DB and made the UI flip to "fully used".
  `heartbeatBody` now omits `available_vram` in that case so the DB keeps its
  last good value.

Also updates the commented GPU blocks in both compose files with
`NVIDIA_DRIVER_CAPABILITIES=compute,utility`, `capabilities: [gpu, utility]`,
and `init: true`, and documents the requirement in the distributed-mode and
nvidia-l4t pages. Without `utility`, NVML/`nvidia-smi` are absent inside the
container, which is what put the DGX Spark worker into the buggy fallback in
the first place.

Detection verified on live hardware (dgx.casa / GB10 and 192.168.68.23 / Thor)
by running a cross-compiled probe of the new helpers on both host and inside
the worker container.

Assisted-by: Claude:opus-4.7 [Claude Code]
2026-04-24 22:02:23 +02:00
Andreas Egli
1d0de757c3 fix: add hipblaslt library (#9541)
Signed-off-by: Andreas Egli <github@kharan.ch>
2026-04-24 18:50:03 +02:00
Alex Brick
e5337039b0 [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543)
* Use latest oneapi-basekit image for Intel images

The current `localai/localai:master-gpu-intel` images don't work with the intel arc pro b70. Updating the base_image to 2025.3.2 fixes it.

Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>

* Update github workflow base image

---------

Signed-off-by: Alex Brick <3220905+arbrick@users.noreply.github.com>
2026-04-24 18:29:10 +02:00
LocalAI [bot]
1c9592c77f chore: ⬆️ Update leejet/stable-diffusion.cpp to b8bdffc19962be7e5a84bfefeb2e31bd885b571a (#9521)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-24 15:15:15 +02:00
Richard Palethorpe
3db60b57e6 fix(realtime): consume ChatDeltas when C++ autoparser clears Response (#9538)
The llama.cpp C++-side chat autoparser clears Reply.Message and delivers
parsed content/reasoning/tool-calls via Reply.chat_deltas. chat.go handles
this (non-SSE path uses ToolCallsFromChatDeltas/ContentFromChatDeltas/
ReasoningFromChatDeltas), but realtime.go only read pred.Response, so any
model routed through the autoparser (Qwen2.5/3 and friends) produced a
silent reply: backend emitted N tokens, the session surface saw zero.

Mirror the non-SSE chat path in realtime's triggerResponse: when deltas
carry tool calls or content, use them directly; otherwise fall back to
the existing raw-text parsing.

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:41:38 +02:00
Richard Palethorpe
13734ae9fa feat: Add Sherpa ONNX backend for ASR and TTS (#8523)
feat(backend): Add Sherpa ONNX backend and Omnilingual ASR

Adds a new Go backend wrapping sherpa-onnx via purego (no cgo). Same
approach as opus/stablediffusion-ggml/whisper — a thin C shim
(csrc/shim.c + shim.h → libsherpa-shim.so) wraps the bits purego
can't reach directly: nested struct config writes, result-struct field
reads, and the streaming TTS callback trampoline. The Go side uses
opaque uintptr handles and purego.NewCallback for the TTS callback.

Supports:
- VAD via sherpa-onnx's Silero VAD
- Offline ASR: Whisper, Paraformer, SenseVoice, Omnilingual CTC
- Online/streaming ASR: zipformer transducer with endpoint detection
  (AudioTranscriptionStream emits delta events during decode)
- Offline TTS: VITS (LJS, etc.)
- Streaming TTS: sherpa-onnx's callback API → PCM chunks on a channel,
  prefixed by a streaming WAV header

Gallery entries: omnilingual-0.3b-ctc-q8-sherpa (1600-language offline
ASR), streaming-zipformer-en-sherpa (low-latency streaming ASR),
silero-vad-sherpa, vits-ljs-sherpa.

E2E coverage: tests/e2e-backends for offline + streaming ASR,
tests/e2e for the full realtime pipeline (VAD + STT + TTS).

Assisted-by: claude-opus-4-7-1M [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-24 14:40:06 +02:00
Ettore Di Giacinto
c0920f3273 fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature (#9531)
Bumps ik_llama.cpp pin to 16996aeab7. Upstream 286ce32...16996ae adds a
trailing `const struct quantize_user_data *` parameter to
`ggml_quantize_chunk` (PR ikawrakow/ik_llama.cpp#1677) but leaves
`examples/llava/clip.cpp` unchanged because their build has moved to
`examples/mtmd/`. LocalAI's prepare.sh still copies from
`examples/llava/`, so the dead 7-arg call reaches the grpc-server
compile and fails. Patch the call site to pass `nullptr` for the new
param.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
2026-04-24 13:07:26 +02:00
LocalAI [bot]
7c1934b183 chore: ⬆️ Update ggml-org/llama.cpp to 187a45637054881ecacf17f8e2f6f8f2ba7df1c7 (#9520)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-24 09:17:06 +02:00
Tai An
5e062b4d1f fix: use SetFunctionCallNameString when forcing a specific tool (3 sites) (#9526)
* fix(anthropic): use SetFunctionCallNameString for specific tool forcing

* fix(openai/realtime): use SetFunctionCallNameString for specific tool forcing

* fix(openresponses): use SetFunctionCallNameString for specific tool forcing
2026-04-24 09:06:42 +02:00
Ettore Di Giacinto
4906cbad04 feat: add biometrics UI (#9524)
* feat(react-ui): add Face & Voice Recognition pages

Expose the face and voice biometrics endpoints
(/v1/face/*, /v1/voice/*) through the React UI. Each page has four
tabs driving the six endpoints per modality: Analyze (demographics
with bounding boxes / waveform segments), Compare (verify with a
match gauge and live threshold slider), Enrollment (register /
identify / forget with a top-K matches view), Embedding (raw
vector inspector with sparkline + copy).

MediaInput supports file upload plus live capture: webcam
snap-to-canvas for face, MediaRecorder -> AudioContext ->
16-bit PCM mono WAV transcode for voice (libsndfile on the
backend only handles WAV/FLAC/OGG natively).

Sidebar gets a new Biometrics section feature-gated on
face_recognition / voice_recognition; routes are wrapped in
<RequireFeature>. No new dependencies -- Font Awesome icons
picked from the Free set.

Assisted-by: Claude:Opus 4.7

* fix(localai): accept data URI prefixes with codec/charset params

Browser MediaRecorder produces data URIs like
  data:audio/webm;codecs=opus;base64,...
so the pre-';base64,' section can carry multiple parameter
segments. The `^data:([^;]+);base64,` regex in pkg/utils/base64.go
and core/http/endpoints/localai/audio.go only matched exactly one
segment, so recordings straight from the React UI's live-capture
tab failed the strip and then tripped the base64 decoder on the
leading 'data:' literal, surfacing as
  "invalid audio base64: illegal base64 data at input byte 4"

Widened both regexes to `^data:[^,]+?;base64,` so any number of
';param=value' segments between the mime type and ';base64,' are
tolerated. Added a regression test covering the MediaRecorder
shape.

Assisted-by: Claude:Opus 4.7

* fix(insightface): scope pack ONNX loading to known manifests

LocalAI's gallery extracts buffalo_* zips flat into the models
directory, which inevitably mixes with ONNX files from other
backends (opencv face engine, MiniFASNet antispoof, WeSpeaker
voice embedding) and older buffalo pack installs. Feeding those
foreign files into insightface's model_zoo.get_model() blows up
inside the router -- it assumes a 4-D NCHW input and indexes
`input_shape[2]` on tensors that aren't shaped like a face model,
raising IndexError mid-load and leaving the backend unusable.

The router's dispatch isn't amenable to per-file try/except alone
(first-file-wins picks det_10g.onnx from buffalo_l even when the
user asked for buffalo_sc -- alphabetical order happens to favour
the wrong pack). Instead, ship an explicit manifest of the
upstream v0.7 pack contents and scope the glob to that when the
requested pack is known. The manifest is small and stable; future
packs can be added alongside or fall through to the tolerance
loop, which also swallows any remaining IndexError / ValueError
from foreign files with a clear `[insightface] skipped` stderr
line for diagnostics.

Assisted-by: Claude:Opus 4.7

* fix(speaker-recognition): extract FBank features for rank-3 ONNX encoders

Pre-exported speaker-encoder ONNX graphs come in two shapes:

  rank-2  [batch, samples]           -- some 3D-Speaker exports,
                                        take raw waveform directly.
  rank-3  [batch, frames, n_mels]    -- WeSpeaker and most Kaldi-
                                        lineage encoders, expect
                                        pre-computed Kaldi FBank.

OnnxDirectEngine unconditionally fed `audio.reshape(1, -1)` --
correct for rank-2, IndexError-on-input_shape[3] on rank-3, which
surfaced to the UI as
  "Invalid rank for input: feats Got: 2 Expected: 3"

Detect the input rank at session init and run Kaldi FBank
(80-dim, 25ms/10ms frames, dither=0.0, per-utterance CMN) before
the forward pass when rank>=3. All knobs are configurable via
backend options for encoders that deviate from defaults.

torchaudio.compliance.kaldi is already in the backend's
requirements (SpeechBrain pulls torchaudio in), so no new
dependency.

Assisted-by: Claude:Opus 4.7

* fix(biometrics): isolate face and voice vector stores

Face (ArcFace, 512-D) and voice (ECAPA-TDNN 192-D / WeSpeaker
256-D) biometric embeddings were colliding inside a single
in-memory local-store instance. Enrolling one after the other
failed with
  "Try to add key with length N when existing length is M"
because local-store correctly refuses to mix dimensions in one
keyspace.

The registries were constructed with `storeName=""`, which in
StoreBackend() is just a WithModel() call. But ModelLoader's
cache is keyed on `modelID`, not `model` -- so both registries
collapsed to the same `modelID=""` slot and reused the same
backend process despite looking isolated on paper.

Three complementary fixes:

  1. application.go -- give each registry a distinct default
     namespace ("localai-face-biometrics" /
     "localai-voice-biometrics"). The comment claimed
     isolation, now it's actually enforced.

  2. stores.go -- pass the storeName as both WithModelID and
     WithModel so the ModelLoader cache key separates
     namespaces and the loader spawns distinct processes.

  3. local-store/store.go -- drop the Load() `opts.Model != ""`
     guard. It was there to prevent generic model-loading loops
     from picking up local-store by accident, but that auto-load
     path is being retired; the guard now just blocks legitimate
     namespace isolation. opts.Model is treated as a tag; the
     per-tuple process isolation upstream handles discrimination.

Assisted-by: Claude:Opus 4.7

* fix(gallery): stale-file cleanup and upgrade-tmp directory safety

Two related robustness fixes for backend install/upgrade:

pkg/downloader/uri.go
  OCI downloads passed through
      if filepath.Ext(filePath) != "" ...
          filePath = filepath.Dir(filePath)
  which was intended to redirect file-shaped download targets
  into their parent directory for OCI extraction. The heuristic
  misfires on directory-shaped paths with a dot-suffix --
  gallery.UpgradeBackend uses
      tmpPath = "<backendsPath>/<name>.upgrade-tmp"
  and Go's filepath.Ext treats ".upgrade-tmp" as an extension.
  The rewrite landed the extraction at "<backendsPath>/", which
  then **overwrote the real install** (backends/<name>/) with a
  flat-layout file and left a stray run.sh at the top level. The
  tmp dir itself stayed empty, so the validation step that
  checked "<tmpPath>/run.sh" predictably failed with
      "upgrade validation failed: run.sh not found in new backend"
  Every manual upgrade silently corrupted the backends tree this
  way. Guard the rewrite behind "target isn't already an existing
  directory" -- InstallBackend / UpgradeBackend both pre-create
  the target as a directory, so they get the correct behaviour;
  existing file-path callers with a genuine dot-extension still
  get the parent redirect.

core/gallery/backends.go
  InstallBackend's MkdirAll returned ENOTDIR when something at
  the target path was already a file (legacy dev builds dropped
  golang backend binaries directly at `<backendsPath>/<name>`
  instead of nesting them under their own subdir). That
  permanently blocked reinstall and upgrade for anyone carrying
  that state, since every retry hit the same error. Detect a
  pre-existing non-directory, warn, and remove it before the
  MkdirAll so the fresh install can write the correct nested
  layout with metadata.json + run.sh.

Assisted-by: Claude:Opus 4.7

* fix(galleryop): refresh upgrade cache after backend ops

UpgradeChecker caches the last upgrade-check result and only
refreshes on the 6-hour tick or after an auto-upgrade cycle.
Manual upgrades (POST /api/backends/upgrade/:name) go through
the async galleryop worker, which completes the upgrade
correctly but never tells UpgradeChecker to re-check -- so
/api/backends/upgrades continued to list a just-upgraded backend
as upgradeable, indistinguishable from a failed upgrade, for up
to six hours.

Add an optional `OnBackendOpCompleted func()` hook on
GalleryService that fires after every successful install /
upgrade / delete on the backend channel (async, so a slow
callback doesn't stall the queue). startup.go wires it to
UpgradeChecker.TriggerCheck after both services exist. Result:
the upgrade banner clears within milliseconds of the worker
finishing.

Assisted-by: Claude:Opus 4.7

* build: prepend GOPATH/bin to PATH for protogen-go

install-go-tools runs `go install` for protoc-gen-go and
protoc-gen-go-grpc, which writes them into `go env GOPATH`/bin.
That directory isn't on every dev's PATH, and protoc resolves
its code-gen plugins via PATH, so the immediately-following
protoc invocation fails with
  "protoc-gen-go: program not found"
which in turn blocks `make build` and any
`make backends/%` target that depends on build.

Prepend `go env GOPATH`/bin to PATH for the protoc invocation
so the freshly-installed plugins are found without requiring a
shell-profile change.

Assisted-by: Claude:Opus 4.7

* refactor(ui-api): non-blocking backend upgrade handler with opcache

POST /api/backends/upgrade/:name used to send the ManagementOp
directly onto the unbuffered BackendGalleryChannel, which blocked
the HTTP request whenever the galleryop worker was busy with a
prior operation. The op also didn't show up in /api/operations,
so the Backends UI couldn't reflect upgrade progress on the
affected row.

Register the op in opcache immediately, wrap it in a cancellable
context, store the cancellation function on the GalleryService,
and push onto the channel from a goroutine so the handler
returns right away. Response gains a `jobID` field and a
`message` string so clients have a consistent handle regardless
of whether the op is queued or running.

Pairs with the OnBackendOpCompleted hook added in the galleryop
commit — together the UI sees the upgrade start, watches
progress via /api/operations, and drops the "upgradeable" flag
the moment the worker finishes.

Assisted-by: Claude:Opus 4.7
2026-04-24 08:50:34 +02:00
LocalAI [bot]
c755cd5ab5 feat(swagger): update swagger (#9518)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:50 +02:00
LocalAI [bot]
0fb04f7ac3 chore(model-gallery): ⬆️ update checksum (#9522)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 23:26:27 +02:00
Ettore Di Giacinto
d9d7b5c29b docs(readme): add April 2026 highlights to Latest News
Assisted-by: Claude-Code:claude-opus-4-7
2026-04-23 20:47:06 +00:00
walcz-de
f877942d97 fix(openresponses): parse OpenAI-spec nested tool_choice + use correct setter (#9509)
Two bugs in MergeOpenResponsesConfig (/v1/responses + WebSocket, *not*
/v1/chat/completions — that has a separate, working path via Tool
unmarshal + SetFunctionCallNameString):

1. **Shape mismatch.** OpenAI's specific-function tool_choice nests the
   name under "function":
       {"type": "function", "function": {"name": "my_function"}}
   The legacy flat shape was:
       {"type": "function", "name": "my_function"}
   Only the flat shape was handled. OpenAI-compliant clients that reach
   /v1/responses (openai-python with the Responses API, Stainless-generated
   SDKs, …) silently failed to force the function.

2. **Wrong setter.** The code called SetFunctionCallString(name), which
   writes the mode field (functionCallString: "none"/"auto"/"required").
   The specific-function name lives in a separate field
   (functionCallNameString), read by ShouldCallSpecificFunction and
   FunctionToCall. Net effect: a correctly-formed tool_choice never
   engaged grammar-based forcing.

The fix preserves backward compatibility by accepting both shapes
(nested preferred, flat as fallback) and routes to the correct setter.

Note: The same "wrong setter" pattern appears at three other sites —
anthropic/messages.go:883, openai/realtime_model.go:171, and
openresponses/responses.go:776 — and /v1/chat/completions has its own
issue parsing tool_choice="required" as a string (json.Unmarshal on a
raw string fails silently). Those are filed as a tracking issue rather
than bundled here to keep this PR focused.

## Test plan
9 new Ginkgo specs under "MergeOpenResponsesConfig tool_choice parsing":
  - string modes: "required" / "auto" / "none"
  - OpenAI-spec nested shape: {type:function, function:{name}}
  - Legacy Anthropic-compat flat shape: {type:function, name}
  - Shape-preference: nested wins over flat when both present
  - Malformed: missing type, wrong type, missing name, empty name, nil

$ go test ./core/http/middleware/ -count=1 -run TestMiddleware
  Ran 28 of 28 Specs in 0.003 seconds -- PASS

## Repro (against /v1/responses)

    curl -N http://localai/v1/responses \
         -H 'Content-Type: application/json' \
         -d '{"model":"qwen3.6-35b-a3b-apex",
              "input":"Weather in Berlin?",
              "tools":[{"type":"function","name":"get_weather",
                        "parameters":{"type":"object",
                          "properties":{"city":{"type":"string"}},
                          "required":["city"]}}],
              "tool_choice":{"type":"function",
                             "function":{"name":"get_weather"}}}'

Before: grammar-based forcing silently inactive; model free-texts.
After : grammar forces get_weather invocation; output contains
        tool_calls with function:{name:"get_weather", arguments:{...}}.
2026-04-23 18:30:05 +02:00
Ettore Di Giacinto
f5eb13d3c2 feat(insightface): add antispoofing (liveness) detection (#9515)
* feat(insightface): add antispoofing (liveness) detection

Light up the anti_spoofing flag that was parked during the first pass.
Both FaceVerify and FaceAnalyze now run the Silent-Face MiniFASNetV2 +
MiniFASNetV1SE ensemble (~4 MB, Apache 2.0, CPU <10ms) when the flag is
set. Failed liveness on either image vetoes FaceVerify regardless of
embedding similarity. Every insightface* gallery entry now ships the
MiniFASNet ONNX weights so existing packs light up after reinstall.

Setting the flag against a model without the MiniFASNet files returns
FAILED_PRECONDITION (HTTP 412) with a clear install message — no
silent is_real=false.

FaceVerifyResponse gained per-image img{1,2}_is_real and
img{1,2}_antispoof_score (proto 9-12); FaceAnalysis's existing
is_real/antispoof_score fields are now populated. Schema fields are
pointers so they are fully absent from the JSON response when
anti_spoofing was not requested — avoids collapsing "not checked" with
"checked and fake" under Go's omitempty on bool.

Validated end-to-end over HTTP against a local install:
- verify + anti_spoofing, both real -> verified=true, score ~0.76
- verify + anti_spoofing, img2 spoof -> verified=false, img2_is_real=false
- analyze + anti_spoofing -> is_real and score per face
- flag against model without MiniFASNet -> HTTP 412 fail-loud

Assisted-by: Claude:claude-opus-4-7 go vet

* test(insightface): wire test target into test-extra

The root Makefile's `test-extra` already runs
`$(MAKE) -C backend/python/insightface test`, but the backend's
Makefile never defined the target — so the command silently errored
and the suite was never executed in CI. Adding the two-line target
(matching ace-step/Makefile) hooks `test.sh` → `runUnittests` →
`python -m unittest test.py`, which discovers both the pre-existing
engine classes (InsightFaceEngineTest, OnnxDirectEngineTest) and the
new AntispoofingTest. Each class skips gracefully when its weights
can't be downloaded from a network-restricted runner.

Assisted-by: Claude:claude-opus-4-7

* test(insightface): exercise antispoofing in e2e-backends (both paths)

Add a `face_antispoof` capability to the Ginkgo e2e suite and extend
the existing FaceVerify + FaceAnalyze specs with liveness assertions
covering BOTH paths:

  real fixture -> is_real=true, score>0, verified stays true
  spoof fixture -> is_real=false, verified vetoed to false

The spoof fixture is upstream's own `image_F2.jpg` (via the yakhyo
mirror) — verified locally against the MiniFASNetV2+V1SE ensemble to
classify as is_real=false with score ~0.013. That makes the assertion
deterministic across CI runs; synthetic/derived spoofs fool the model
unpredictably and would be flaky.

Makefile wires it up end-to-end:
- New INSIGHTFACE_ANTISPOOF_* cache dir + two ONNX downloads with
  pinned SHAs, matching the gallery entries.
- insightface-antispoof-models target shared by both backend configs.
- FACE_SPOOF_IMAGE_URL passed via BACKEND_TEST_FACE_SPOOF_IMAGE_URL.
- Both e2e targets (buffalo-sc + opencv) now:
  * depend on insightface-antispoof-models
  * pass antispoof_v2_onnx / antispoof_v1se_onnx in BACKEND_TEST_OPTIONS
  * include face_antispoof in BACKEND_TEST_CAPS

backend_test.go adds the new capability constant and a faceSpoofFile
fixture resolved the same way as faceFile1/2/3. Spoof assertions are
gated on both capFaceAntispoof AND faceSpoofFile being set, so a test
config that omits the spoof fixture degrades gracefully to "real path
only" instead of failing.

Assisted-by: Claude:claude-opus-4-7 go vet
2026-04-23 18:28:15 +02:00
Ettore Di Giacinto
c1f923b2bc fix(importer): emit all shards for multi-part GGUF models (#9513)
The llama-cpp HuggingFace importer iterated files one at a time and
kept overwriting `lastGGUFFile`, so sharded repos such as
`unsloth/Kimi-K2.6-GGUF` (14 `Q8_K_XL` parts) produced a gallery entry
pointing only at the final shard — useless to llama.cpp's split loader,
which needs shard 1 to discover the set.

Group shards up front via new helpers in `pkg/huggingface-api`
(`SplitShardSuffix`, `ShardGroup`, `GroupShards`). The llama-cpp
importer now picks a group (preferred quant, then last-group fallback)
and emits every shard, with `Model:` pointing at shard 1.
`FindPreferredModelFile` returns shard 1 of the first matching group so
the gallery agent's preview stays coherent for sharded repos.

Adds unit coverage for the HuggingFace branch of the importer (which
had none), plus shard-detection tests in the hfapi package.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
2026-04-23 15:00:02 +02:00
Ettore Di Giacinto
ed648b3b4e fix(llama-cpp): include server-chat.cpp in grpc-server translation unit (#9511)
* fix(llama-cpp): include server-chat.cpp in grpc-server translation unit

Upstream llama.cpp refactor (ggml-org/llama.cpp#20690) moved the
OAI/Anthropic/Responses and transcription conversion helpers out of
server-common.cpp into a new server-chat.cpp, and server-task.cpp and
server-context.cpp now call those symbols (convert_transcriptions_to_chatcmpl,
server_chat_convert_responses_to_chatcmpl, server_chat_convert_anthropic_to_oai,
server_chat_msg_diff_to_json_oaicompat) via server-chat.h.

grpc-server.cpp builds as a single translation unit by #include-ing the
upstream .cpp files directly. Without including server-chat.cpp, the
declarations are satisfied at compile time via server-chat.h but the
link step fails with undefined references once LLAMA_VERSION crosses
the refactor commit (134d6e54).

Guard the include with __has_include so the same source stays buildable
on older LLAMA_VERSION pins that predate the refactor (where prepare.sh
won't copy server-chat.cpp into tools/grpc-server/).

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(llama-cpp): bump LLAMA_VERSION to 0d0764dfd

Bump to ggml-org/llama.cpp@0d0764dfd2.
Paired with the preceding grpc-server server-chat.cpp include so the
refactor at 134d6e54 links cleanly. Supersedes PR #9494.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-23 14:59:39 +02:00
Ettore Di Giacinto
3ce5248126 Update expected length of instructions in test
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-23 14:58:57 +02:00
Ettore Di Giacinto
04f1a0285d fix(ik-llama-cpp): adapt to common_grammar struct in sampling.h (#9512)
Upstream ik_llama.cpp commit e0596bf6 ("Autoparser") changed
common_params_sampling::grammar from std::string to a common_grammar
struct (type + grammar), which broke our two direct accesses:

 - JSON ingest fed the field through json_value<common_grammar>(...),
   for which nlohmann has no from_json adapter.
 - JSON export emitted the struct directly, for which nlohmann has no
   to_json adapter.

Wrap the incoming JSON string in common_grammar{COMMON_GRAMMAR_TYPE_USER, ...}
and serialize via the inner .grammar member, mirroring upstream's
examples/server/server-context.cpp.

Also bump IK_LLAMA_VERSION to 286ce324baed17c95faec77792eaa6bdb1c7a5f5
so the local-ai side lines up with the dependency bump in #9496.

Assisted-by: Claude-Code:claude-opus-4-7
2026-04-23 13:45:06 +02:00
Ettore Di Giacinto
181ebb6df4 feat: voice recognition (#9500)
* feat(voice-recognition): add /v1/voice/{verify,analyze,embed} + speaker-recognition backend

Audio analog to face recognition. Adds three gRPC RPCs
(VoiceVerify / VoiceAnalyze / VoiceEmbed), their Go service and HTTP
layers, a new FLAG_SPEAKER_RECOGNITION capability flag, and a Python
backend scaffold under backend/python/speaker-recognition/ wrapping
SpeechBrain ECAPA-TDNN with a parallel OnnxDirectEngine for
WeSpeaker / 3D-Speaker ONNX exports.

The kokoros Rust backend gets matching unimplemented trait stubs —
tonic's async_trait has no defaults, so adding an RPC without Rust
stubs breaks the build (same regression fixed by eb01c772 for face).

Swagger, /api/instructions, and the auth RouteFeatureRegistry /
APIFeatures list are updated so the endpoints surface everywhere a
client or admin UI looks.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): add 1:N identify + register/forget endpoints

Mirrors the face-recognition register/identify/forget surface. New
package core/services/voicerecognition/ carries a Registry interface
and a local-store-backed implementation (same in-memory vector-store
plumbing facerecognition uses, separate instance so the embedding
spaces stay isolated).

Handlers under /v1/voice/{register,identify,forget} reuse
backend.VoiceEmbed to compute the probe vector, then delegate the
nearest-neighbour search to the registry. Default cosine-distance
threshold is tuned for ECAPA-TDNN on VoxCeleb (0.25, EER ~1.9%).

As with the face registry, the current backing is in-memory only — a
pgvector implementation is a future constructor-level swap.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): gallery, docs, CI and e2e coverage

- backend/index.yaml: speaker-recognition backend entry + CPU and
  CUDA-12 image variants (plus matching development variants).
- gallery/index.yaml: speechbrain-ecapa-tdnn (default) and
  wespeaker-resnet34 model entries. The WeSpeaker SHA-256 is a
  deliberate placeholder — the HF URI must be curl'd and its hash
  filled in before the entry installs.
- docs/content/features/voice-recognition.md: API reference + quickstart,
  mirrors the face-recognition docs.
- React UI: CAP_SPEAKER_RECOGNITION flag export (consumers follow face's
  precedent — no dedicated tab yet).
- tests/e2e-backends: voice_embed / voice_verify / voice_analyze specs.
  Helper resolveFaceFixture is reused as-is — the only thing face/voice
  share is "download a file into workDir", so no need for a new helper.
- Makefile: docker-build-speaker-recognition + test-extra-backend-
  speaker-recognition-{ecapa,all} targets. Audio fixtures default to
  VCTK p225/p226 samples from HuggingFace.
- CI: test-extra.yml grows a tests-speaker-recognition-grpc job
  mirroring insightface. backend.yml matrix gains CPU + CUDA-12 image
  build entries — scripts/changed-backends.js auto-picks these up.

Assisted-by: Claude:claude-opus-4-7

* feat(voice-recognition): wire a working /v1/voice/analyze head

Adds AnalysisHead: a lazy-loading age / gender / emotion inference
wrapper that plugs into both SpeechBrainEngine and OnnxDirectEngine.

Defaults to two open-licence HuggingFace checkpoints:
  - audeering/wav2vec2-large-robust-24-ft-age-gender (Apache 2.0) —
    age regression + 3-way gender (female / male / child).
  - superb/wav2vec2-base-superb-er (Apache 2.0) — 4-way emotion.

Both are optional and degrade gracefully when transformers or the
model can't be loaded — the engine raises NotImplementedError so the
gRPC layer returns 501 instead of a generic 500.

Emotion classes pass through from the model (neutral/happy/angry/sad
on the default checkpoint); the e2e test now accepts any non-empty
dominant gender so custom age_gender_model overrides don't fail it.

Adds transformers to the backend's CPU and CUDA-12 requirements.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): pin real WeSpeaker ResNet34 ONNX SHA-256

Replaces the placeholder hash in gallery/index.yaml with the actual
SHA-256 (7bb2f06e…) of the upstream
Wespeaker/wespeaker-voxceleb-resnet34-LM ONNX at ~25MB. `local-ai
models install wespeaker-resnet34` now succeeds.

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): soundfile loader + honest analyze default

Two issues surfaced on first end-to-end smoke with the actual backend
image:

1. torchaudio.load in torchaudio 2.8+ requires the torchcodec package
   for audio decoding. Switch SpeechBrainEngine._load_waveform to the
   already-present soundfile (listed in requirements.txt) plus a numpy
   linear resample to 16kHz. Drops a heavy ffmpeg-linked dep and the
   codepath we never exercise (torchaudio's ffmpeg backend).

2. The AnalysisHead was defaulting to audeering/wav2vec2-large-robust-
   24-ft-age-gender, but AutoModelForAudioClassification silently
   mangles that checkpoint — it reports the age head weights as
   UNEXPECTED and re-initialises the classifier head with random
   values, so the "gender" output is noise and there is no age output
   at all. Make age/gender opt-in instead (empty default; users wire
   a cleanly-loadable Wav2Vec2ForSequenceClassification checkpoint via
   age_gender_model: option). Emotion keeps its working Superb default.
   Also broaden _infer_age_gender's tensor-shape handling and catch
   runtime exceptions so a dodgy age/gender head never takes down the
   whole analyze call.

Docs and README updated to match the new policy.

Verified with the branch-scoped gallery on localhost:
- voice/embed    → 192-d ECAPA-TDNN vector
- voice/verify   → same-clip dist≈6e-08 verified=true; cross-speaker
                   dist 0.76–0.99 verified=false (as expected)
- voice/register/identify/forget → round-trip works, 404 on unknown id
- voice/analyze  → emotion populated, age/gender omitted (opt-in)

Assisted-by: Claude:claude-opus-4-7

* fix(voice-recognition): real CI audio fixtures + fixture-agnostic verify spec

Two issues surfaced after CI actually ran the speaker-recognition e2e
target (I'd curl-tested against a running server but hadn't run the
make target locally):

1. The default BACKEND_TEST_VOICE_AUDIO_* URLs pointed at
   huggingface.co/datasets/CSTR-Edinburgh/vctk paths that return 404
   (the dataset is gated). Swap them for the speechbrain test samples
   served from github.com/speechbrain/speechbrain/raw/develop/ —
   public, no auth, correct 16kHz mono format.

2. The VoiceVerify spec required d(file1,file2) < 0.4, assuming
   file1/file2 were same-speaker. The speechbrain samples are three
   different speakers (example1/2/5), and there is no easy un-gated
   source of true same-speaker audio pairs (VoxCeleb/VCTK/LibriSpeech
   are all license- or size-gated for CI use). Replace the ceiling
   check with a relative-ordering assertion: d(pair) > d(same-clip)
   for both file2 and file3 — that's enough to prove the embeddings
   encode speaker info, and it works with any three non-identical
   clips. Actual speaker ordering d(1,2) vs d(1,3) is logged but not
   asserted.

Local run: 4/4 voice specs pass (Health, LoadModel, VoiceEmbed,
VoiceVerify) on the built backend image. 12 non-voice specs skipped
as expected.

Assisted-by: Claude:claude-opus-4-7

* fix(ci): checkout with submodules in the reusable backend_build workflow

The kokoros Rust backend build fails with

    failed to read .../sources/Kokoros/kokoros/Cargo.toml: No such file

because the reusable backend_build.yml workflow's actions/checkout
step was missing `submodules: true`. Dockerfile.rust does `COPY .
/LocalAI`, and without the submodule files the subsequent `cargo
build` can't find the vendored Kokoros crate.

The bug pre-dates this PR — scripts/changed-backends.js only triggers
the kokoros image job when something under backend/rust/kokoros or
the shared proto changes, so master had been coasting past it. The
voice-recognition proto addition re-broke it.

Other checkouts in backend.yml (llama-cpp-darwin) and test-extra.yml
(insightface, kokoros, speaker-recognition) already pass
`submodules: true`; this brings the shared backend image builder in
line.

Assisted-by: Claude:claude-opus-4-7
2026-04-23 12:07:14 +02:00
LocalAI [bot]
1c59165d63 chore(model gallery): 🤖 add 1 new models via gallery agent (#9505)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 09:32:44 +02:00
LocalAI [bot]
eb00d9b178 chore: ⬆️ Update leejet/stable-diffusion.cpp to c97702e1057c2fe13a7074cd9069cb9dd6edc1bf (#9495)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-23 09:32:21 +02:00
LocalAI [bot]
2068b6f43c feat(swagger): update swagger (#9498)
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 22:51:39 +02:00
Ettore Di Giacinto
eb01c77214 fix(kokoros): implement face_verify and face_analyze trait stubs (#9499)
The backend.proto was updated to add FaceVerify and FaceAnalyze RPCs
(face detection support), but the Rust KokorosService was never updated
to match the regenerated tonic trait, breaking compilation with E0046:

    not all trait items implemented, missing: `face_verify`, `face_analyze`

Stubs both methods as unimplemented, matching the pattern used for the
other RPCs Kokoros does not support.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]
2026-04-22 22:51:18 +02:00
Richard Palethorpe
bb4fda6f0e chore(agents): Update the backend creation instructions to include Rust and extra tests (#9490)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-22 22:43:01 +02:00
Ettore Di Giacinto
f0c92610a1 feat(importer): expand importer flow to almost all backends (#9466)
* docs(agents): require importer integration when adding backends

Document the importer registry workflow so contributors know that adding
a new backend also requires updating the /import-model dropdown source:
either a new importer in core/gallery/importers/, extending an existing
one for drop-in replacements, or the pref-only slice for backends with
no reliable auto-detect signal. Always covered by a table-driven test.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for Batch 0 primitives

Introduce failing tests that drive Batch 0 of the importer expansion:

- pkg/huggingface-api: assert GetModelDetails populates PipelineTag and
  LibraryName from /api/models/{repo}, and that a failing metadata
  endpoint still returns file details (best-effort fetch).
- core/gallery/importers/helpers_test.go: new table-driven coverage for
  HasFile, HasExtension, HasONNX, HasONNXConfigPair, HasGGMLFile.
- core/gallery/importers/importers_test.go: assert ErrAmbiguousImport
  sentinel exists and round-trips through errors.Is.
- core/gallery/importers/local_test.go: extend with detection cases for
  ggml-*.bin (whisper), silero_vad.onnx (silero-vad), and the piper
  .onnx + .onnx.json pair.
- core/http/endpoints/localai/import_model_test.go: assert
  ImportModelURIEndpoint returns HTTP 400 with a structured
  {error, detail, hint} body when ErrAmbiguousImport surfaces.

All tests fail in the expected places (missing fields, missing
helpers, missing sentinel, endpoint still wraps as 500).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): Batch 0 foundation — helpers, sentinel, local detection

Implements the Batch 0 primitives that subsequent importer batches build on:

- pkg/huggingface-api: ModelDetails gains PipelineTag and LibraryName.
  GetModelDetails now layers a best-effort GET /api/models/{repo} fetch
  on top of ListFiles — a metadata outage leaves the fields empty but
  still returns full file details. Uses a dedicated response struct
  because the single-model endpoint uses snake_case keys while the list
  endpoint historically returned camelCase.

- core/gallery/importers/helpers.go: generic HasFile, HasExtension,
  HasONNX, HasONNXConfigPair, HasGGMLFile helpers working on
  []hfapi.ModelFile so per-backend importers can detect artefact
  patterns without duplicating string wrangling.

- core/gallery/importers/importers.go: adds the ErrAmbiguousImport
  sentinel. DiscoverModelConfig now returns it (wrapped with
  fmt.Errorf("%w: ...")) when no importer matched AND the HF
  pipeline_tag falls in a whitelist of narrow modalities (ASR, TTS,
  sentence-similarity, text-classification, object-detection). The
  whitelist is intentionally narrow — unknown tags keep the previous
  "no importer matched" behaviour to avoid blocking rare repos.

- core/gallery/importers/local.go: three new local-path detections,
  inserted before the existing merged-transformers branch:
    * ggml-*.bin → whisper
    * silero*.onnx → silero-vad
    * *.onnx + *.onnx.json pair → piper

- core/http/endpoints/localai/import_model.go: ImportModelURIEndpoint
  surfaces ErrAmbiguousImport as HTTP 400 with
  {error, detail, hint} JSON, preserving existing behaviour for
  unrelated errors.

Green tests:
  go test ./core/gallery/importers/... ./pkg/huggingface-api/... \
          ./core/http/endpoints/localai/...

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(importers): red tests for KnownBackend endpoint and importer metadata

Add failing tests that drive Batch UI-Dropdown:

- importers_test.go: assert importers expose Name/Modality/AutoDetects
  and that LlamaCPPImporter advertises drop-in replacements via a new
  AdditionalBackendsProvider interface. A Registry() accessor is also
  expected.

- backend_test.go (new): assert GET /backends/known returns
  []schema.KnownBackend, covers every importer, exposes drop-in
  llama-cpp replacements, includes curated pref-only backends, has no
  duplicates, and is sorted by Modality+Name.

These tests fail at compile time against master; they are intentionally
red so the follow-up green commit is reviewable.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery): add /backends/known endpoint for importer-aware backend list

Extend the Importer interface with Name/Modality/AutoDetects so the
import system can self-describe its registry, and introduce the
AdditionalBackendsProvider interface so importers can advertise drop-in
replacements (llama-cpp advertises ik-llama-cpp and turboquant).

Expose the new GET /backends/known endpoint that merges:

- the importer registry (auto-detect supported),
- drop-in replacements hosted by importers (preference-only),
- a curated knownPrefOnlyBackends slice for backends with no dedicated
  importer (sglang, tinygrad, trl, mlx-vlm, whisperx, kokoros, Qwen TTS
  variants, sam3-cpp) — kept at the top of backend.go so contributors
  adding a new pref-only backend have one obvious place to edit,
- backends installed on disk but unknown to the importer (marked
  AutoDetect=false, empty Modality).

The endpoint deliberately does NOT filter by gallery membership or host
capability (unlike /backends/available): LocalAI may auto-install a
backend that is not yet present, so the import form dropdown must show
everything the importer knows about.

Response is deduplicated (importer wins over pref-only) and sorted by
Modality+Name for deterministic output.

Registered in core/http/routes/localai.go next to /backends/available
under the same admin middleware.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui): source import form backend dropdown from /backends/known

Replace the hard-coded BACKENDS constant in ImportModel.jsx with a
live fetch of /backends/known on mount. Users now see every backend
the importer layer knows about (including preference-only entries)
grouped by modality, not a stale subset.

Changes:

- config.js: add backendsKnown endpoint constant next to
  backendsAvailable.
- api.js: add backendsApi.listKnown() wrapper.
- ImportModel.jsx: remove BACKENDS constant, fetch the list via
  useEffect, and derive grouped options via buildBackendOptions.
  Preference-only entries render with a " (preference-only)" suffix.
  Loading state disables the dropdown with a "Loading backends…"
  placeholder; on fetch failure the form falls back to auto-detect
  only and surfaces a non-blocking toast.
- SearchableSelect.jsx: accept items flagged isHeader=true and render
  them as non-selectable section dividers. Keyboard navigation skips
  headers and search queries hide them so filtered output stays
  relevant.

Vitest is not set up in this project (devDependencies ship Playwright
only). Per the brief's guard-rail, no frontend test framework is
introduced; coverage is provided by the Go handler tests that assert
the /backends/known contract consumed by the React form.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for whisper importer

Asserts detection on ggerganov/whisper.cpp (via ggml-*.bin filename),
the preferences.backend=whisper override path for arbitrary URIs,
and the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add whisper importer

Recognises whisper.cpp GGML models by the "ggml-*.bin" filename
convention (direct URL or HF repo member) and by the explicit
preferences.backend="whisper" override. Emits backend: whisper with
the transcript use-case. Registered before llama-cpp so the narrow
filename signal wins before any generic GGUF match is attempted.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for moonshine importer

Asserts detection on UsefulSensors/moonshine-tiny via owner + ONNX
files, the preferences.backend=moonshine override for arbitrary URIs,
and the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add moonshine importer

Matches UsefulSensors-owned HF repos whose artefacts or metadata
identify them as ASR: on-disk .onnx files (the canonical Moonshine
packaging) OR pipeline_tag=automatic-speech-recognition (covers
transformers/safetensors-only sibling repos). preferences.backend=
moonshine overrides detection. Test uses the live moonshine-tiny
repo because the canonical UsefulSensors/moonshine repo currently
hits a recursive-subfolder bug in pkg/huggingface-api ListFiles.

Registered after WhisperImporter but before LlamaCPPImporter and
TransformersImporter so the narrower owner+ASR signal wins before
the generic tokenizer.json check routes the repo to transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for nemo importer

Asserts detection on nvidia/parakeet-tdt-0.6b-v3 via owner + .nemo
file, the preferences.backend=nemo override for arbitrary URIs, and
the Importer interface metadata (name/modality/autodetect).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add nemo importer

Matches nvidia-owned HF repos that ship a .nemo checkpoint archive,
the canonical NeMo ASR packaging. preferences.backend=nemo forces
detection. Registered between moonshine and llama-cpp so the narrow
owner + extension signal wins before any downstream generic matcher.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for faster-whisper importer

Asserts detection on Systran/faster-whisper-large-v3 (owner +
model.bin + config.json + ASR pipeline), the preferences.backend=
faster-whisper override for arbitrary URIs, and the Importer
interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add faster-whisper importer

Recognises CTranslate2-packaged whisper checkpoints distributed for
the faster-whisper runtime: model.bin + config.json + ASR
pipeline_tag, narrowed to Systran-owned repos or repo names
containing "faster-whisper" to avoid falsely claiming vanilla
OpenAI whisper HF repos. preferences.backend=faster-whisper
overrides detection. Registered before llama-cpp and transformers
so the narrow signal wins before tokenizer.json routes the repo to
the generic transformers importer.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for qwen-asr importer

Asserts detection on Qwen/Qwen3-ASR-1.7B via owner + ASR substring
in the repo name, the preferences.backend=qwen-asr override for
arbitrary URIs, and the Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add qwen-asr importer

Matches Qwen-owned HF repos whose name contains "ASR"
(case-insensitive), routing them to the qwen-asr backend rather
than the generic transformers/vllm path. The substring check scans
the repo portion only so the owner field cannot leak a false match.
preferences.backend=qwen-asr forces detection. Registered before
llama-cpp and transformers so the narrow owner+name signal wins.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): ASR ambiguity surfaces ErrAmbiguousImport

Locks in the behaviour added in Batch 0: an HF repo whose pipeline_tag
marks it as automatic-speech-recognition but whose artefacts match no
ASR importer (and no generic importer) must fail with
ErrAmbiguousImport so callers know to pass preferences.backend rather
than silently guess. pyannote/voice-activity-detection is the fixture
— its file list is only config.yaml + README, leaving every importer's
artefact check negative.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for piper importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add piper importer

Detects piper TTS voices by the canonical <voice>.onnx + <voice>.onnx.json
pair packaging (via HasONNXConfigPair). Narrow enough to skip generic
ONNX repos used by other backends (Moonshine ASR, sentence-transformers).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for bark importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add bark importer

Detects Suno's Bark TTS checkpoints by HF owner "suno" + repo name
prefix "bark". Adds HFOwnerRepoFromURI() helper so importers can fall
back to URI parsing when pkg/huggingface-api's recursive tree listing
errors on repos with nested subdirectories (suno/bark ships a
speaker_embeddings/v2 subtree that trips a pre-existing path-doubling
bug in the listFilesInPath recursion).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for fish-speech importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add fish-speech importer

Detects Fish Audio TTS releases by HF owner "fishaudio" with a URI-based
fallback for repos whose tree recursion trips the pre-existing hfapi
path-doubling bug.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for outetts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add outetts importer

Detects OuteAI's OuteTTS releases by HF owner "OuteAI" or a case-
insensitive "OuteTTS" substring in the repo name, with a URI-based
fallback for recursion-bugged repos.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for voxcpm importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add voxcpm importer

Detects OpenBMB's VoxCPM TTS family by repo-name substring (community
mirrors re-host the weights under many owners — mlx-community,
bluryar, callgg, etc).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for kokoro importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add kokoro importer

Detects hexgrad's Kokoro TTS by the "Kokoro" repo-name substring paired
with a PyTorch .pth/.pt checkpoint — the pairing excludes ONNX-only
mirrors (handled by the pref-only `kokoros` Rust runtime) and GGUF
mirrors (handled by llama-cpp).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for kitten-tts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add kitten-tts importer

Detects KittenML's kitten-tts releases by owner or "kitten-tts" repo-name
substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for neutts importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add neutts importer

Detects Neuphonic's NeuTTS releases by owner "neuphonic" or "neutts"
repo-name substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for chatterbox importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add chatterbox importer

Detects Resemble AI's Chatterbox TTS by owner "ResembleAI" or
"chatterbox" repo-name substring, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for vibevoice importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add vibevoice importer

Detects Microsoft's VibeVoice TTS by "vibevoice" repo-name substring
(case-insensitive) so community mirrors still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for coqui importer

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add coqui importer

Detects Coqui AI's TTS releases (XTTS-v2, YourTTS, …) by the
authoritative `coqui` HF owner, with URI-parsing fallback.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): TTS ambiguity surfaces ErrAmbiguousImport

Adds a Ginkgo spec that imports nari-labs/Dia-1.6B — a real HF repo
carrying pipeline_tag="text-to-speech" whose artefacts (*.pth, one
safetensors shard, preprocessor_config.json, config.json) match none of
the Batch-2 TTS importers nor the generic text/image importers — and
asserts DiscoverModelConfig wraps ErrAmbiguousImport via errors.Is.

Also pivots the endpoint-level ambiguity fixture from hexgrad/Kokoro-82M
to nari-labs/Dia-1.6B. Batch 2 added a dedicated kokoro importer that
now claims the original fixture; Dia remains genuinely unclaimed and
so exercises the same ambiguity code path at the HTTP layer.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for stablediffusion-ggml importer

Covers HF repo detection (city96/FLUX.1-dev-gguf), raw .gguf URL matching on
filename arch tokens, preference override, and Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add stablediffusion-ggml importer

Detects GGUF-packed Stable Diffusion and FLUX checkpoints (leejet owner,
city96 FLUX mirrors, second-state SD dumps, raw .gguf URLs with arch
tokens) and routes them to the stablediffusion-ggml backend. Registered
BEFORE LlamaCPPImporter so .gguf image checkpoints are not stolen by
llama-cpp's generic .gguf match. Reuses HFOwnerRepoFromURI for the
hfapi-recursion-bug fallback. preferences.backend overrides detection.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for ace-step importer

Covers HF repo-name detection (ACE-Step/ACE-Step-v1-3.5B), preference
override, and Importer interface metadata.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add ace-step importer

Routes ACE-Step music generation checkpoints (ACE-Step/ACE-Step-v1-3.5B,
ACE-Step/Ace-Step1.5, community mirrors) to the ace-step backend.
Matching is case-insensitive on the "ace-step" repo-name substring and
owner, with an HFOwnerRepoFromURI fallback for the hfapi recursion bug.
KnownUsecaseStrings mirrors the gallery's ace-step-turbo entry
(sound_generation, tts). preferences.backend overrides.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): surface ErrAmbiguousImport on text-to-image misses

Adds text-to-image to ambiguousModalities whitelist and covers the
h94/IP-Adapter-FaceID case — pipeline_tag=text-to-image but ships only
.bin/.safetensors so diffusers, stablediffusion-ggml, llama-cpp,
transformers, vllm, mlx, and ace-step all miss. DiscoverModelConfig now
surfaces ErrAmbiguousImport for that shape instead of the opaque
"no importer matched" error.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for vllm-omni importer

Introduces the test surface for the forthcoming VLLMOmniImporter:
detection via preferences.backend, Qwen owner + Omni repo token,
URI-only fallback, negative cases (plain Qwen, random OmniX repo), and
Import() emitting backend: vllm-omni with chat + multimodal usecases.

Includes a registration-order assertion via DiscoverModelConfig to pin
the requirement that vllm-omni wins over vllm for Qwen Omni repos
(tokenizer files are usually present too).

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add vllm-omni importer

Adds VLLMOmniImporter for Qwen Omni-style multimodal checkpoints
(Qwen3-Omni, Qwen2.5-Omni, …). Detection is narrow: HF owner "Qwen"
combined with "omni" in the repo name, or a repo name matching the
-Omni-/Omni- naming pattern. preferences.backend="vllm-omni" always
wins; HFOwnerRepoFromURI provides a URI-only fallback for the hfapi
recursion-bug edge case.

Emitted YAML sets backend: vllm-omni and known_usecases: [chat,
multimodal], matching the gallery/index.yaml vllm-omni entries. The
importer is registered ahead of VLLMImporter so Qwen Omni repos —
which also carry tokenizer files — route to vllm-omni rather than the
plain vllm backend.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for llama-cpp drop-in preferences

Pins the expected drop-in replacement behaviour: preferences.backend
of ik-llama-cpp or turboquant must swap the emitted YAML backend
field while keeping the llama-cpp file layout identical. Also covers
the unknown-backend case (must stay llama-cpp) and re-asserts
AdditionalBackends() returns the two curated entries with non-empty
descriptions.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): llama-cpp honours ik-llama-cpp and turboquant drop-in preferences

preferences.backend set to ik-llama-cpp or turboquant now swaps the
emitted YAML backend field while leaving the file layout, model path,
mmproj handling and everything else in the llama-cpp Import pipeline
untouched. Unknown values are ignored and fall back to backend:
llama-cpp so arbitrary input can't leak into the config.

Aligns the AdditionalBackends() descriptions with the user-facing
naming conventions surfaced via /backends/known. No changes to the
pref-only curated list in endpoints/localai/backend.go: the two
drop-in names have always lived on the importer side via
AdditionalBackends.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for silero-vad importer

Add the SileroVADImporter test fixtures covering metadata, preference
overrides, snakers4 + onnx detection, silero_vad.onnx canonical filename,
URI fallback, and live HF discovery. Implementation follows in the next
commit.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add silero-vad importer

Recognise the Silero VAD ONNX packaging: the canonical silero_vad.onnx
filename or any ONNX file under the snakers4 owner. Emits a
backend: silero-vad config with the vad known_usecase, and attaches the
canonical file entry when present so the weights download on import.

Registered before the generic importers so the unique-filename signal
takes precedence over any downstream tokenizer-based matcher.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for rerankers importer

Cover the RerankersImporter contract: interface metadata, preference
override, cross-encoder owner detection, case-insensitive 'reranker'
substring match (BAAI/bge-reranker, Alibaba-NLP/gte-reranker), URI
fallback, and the full-discovery ordering check that a BAAI reranker
repo must route to the rerankers importer rather than transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add rerankers importer

Recognise reranker repositories — cross-encoder owner or any repo whose
name contains 'reranker' (case-insensitive). Emits backend: rerankers
with reranking: true and the rerank known_usecase.

Registered ahead of sentencetransformers and transformers so reranker
repos that happen to ship tokenizer.json or modules.json still route
here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for sentencetransformers importer

Cover the SentenceTransformersImporter contract: interface metadata,
preference override, modules.json marker file, sentence_bert_config.json
marker file, sentence-transformers owner, URI fallback, and the
full-discovery ordering check that ensures a sentence-transformers HF
URI routes here rather than transformers.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add sentencetransformers importer

Recognise sentence-transformers embedding repos by modules.json,
sentence_bert_config.json, or the sentence-transformers owner. Emits
backend: sentencetransformers with embeddings: true and the embeddings
known_usecase.

Registered ahead of transformers so ST repos that carry tokenizer.json
still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): add failing tests for rfdetr importer

Cover the RFDetrImporter contract: interface metadata, preference
override, case-insensitive rf-detr and rfdetr substring matches, URI
fallback, and negative cases. Implementation follows in the next
commit.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(gallery/importers): add rfdetr importer

Recognise RF-DETR object-detection repositories by a case-insensitive
'rf-detr' / 'rfdetr' substring in the repo name. Emits backend: rfdetr
with the detection known_usecase.

Registered ahead of transformers so RF-DETR repos with tokenizer
artefacts still route here.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(gallery/importers): surface ErrAmbiguousImport on sentence-similarity misses

Add an ambiguity fixture covering the embeddings/rerankers modality.
Qdrant/bm25 carries pipeline_tag=sentence-similarity but ships only
config.json + stopword .txt files — none of the Batch 5 importers
(silero-vad, rerankers, sentencetransformers, rfdetr) or the generic
vllm/transformers/llama-cpp/mlx/diffusers importers match. Because the
modality is in the ambiguous whitelist, DiscoverModelConfig must
surface ErrAmbiguousImport.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(localai/backend): red tests for KnownBackend.Installed flag

Extend the /backends/known suite with three failing cases that pin down
the forthcoming Installed field: JSON field presence on every entry,
flipping to true when an importer-registered backend is also present on
disk (and staying false for non-installed pref-only entries), and
surfacing system-only backends with empty modality and AutoDetect=false.

A small writeFakeSystemBackend helper plants a run.sh under the backends
dir so gallery.ListSystemBackends recognises the fixture.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(schema,localai/backend): add Installed flag to KnownBackend

Add an Installed bool to schema.KnownBackend and populate it from the
/backends/known handler so the React import form can warn users that
picking a not-yet-installed backend will trigger an automatic download
on submit.

Computation: after merging the importer registry, additional backends
provider entries and the curated pref-only slice, the handler walks
gallery.ListSystemBackends(systemState) and either flips the existing
map entry's Installed flag to true (preserving modality / autodetect /
description metadata) or inserts a bare {Installed:true} entry for
system-only backends the importer layer doesn't know about.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(localai/import_model): structured ambiguous-import response

Add red tests covering the extended ambiguity shape the React import
form needs:

- ImportModelURIEndpoint must return an HTTP 400 body that exposes the
  detected `modality` (normalised to the importer modality key, e.g.
  "tts" for pipeline_tag=text-to-speech) and a list of `candidates`
  (backend names filtered by modality, excluding text-LLM backends).
- The importers package must surface a typed AmbiguousImportError so
  HTTP consumers can read Modality + Candidates without parsing the
  error string. errors.Is against the existing sentinel keeps working.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(localai/import_model): structured ambiguity response with modality + candidates

DiscoverModelConfig now returns a typed AmbiguousImportError that
carries the importer modality key, candidate backend names, the
original URI, and the raw HF pipeline_tag. Its Is() preserves
errors.Is(err, ErrAmbiguousImport) for legacy callers.

The importer modality is pre-mapped from the HF pipeline_tag
(automatic-speech-recognition → asr, text-to-speech → tts, etc) via
PipelineTagToModality — surfaced as an exported helper so downstream
consumers can avoid duplicating the table. CandidatesForModality
filters the default importer registry plus AdditionalBackendsProvider
drop-ins by modality, sorts deterministically, and is the single
source of truth used by ImportModelURIEndpoint.

ImportModelURIEndpoint now returns HTTP 400 with
  { error, detail, modality, candidates, hint }
when ambiguity fires, letting the React form render a modality-scoped
picker inline instead of a generic toast.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): manual pick badge + tooltip

Red Playwright coverage for the preference-only → manual pick rename:

- The Backend dropdown renders a "manual pick" badge on every option
  whose KnownBackend.auto_detect is false.
- The badge carries a title attribute with hover-tooltip copy that
  explains auto-detect won't route to this backend.
- Auto-detectable backends must NOT carry the badge.
- The legacy " (preference-only)" suffix is gone from every label.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): replace preference-only suffix with manual pick badge

SearchableSelect option rows now support an optional badge field — a
muted pill rendered to the right of the label with an optional title
attribute for native hover tooltips. Plain text so screen readers read
it alongside the option name.

buildBackendOptions in ImportModel stops appending " (preference-only)"
to the label and instead sets badge="manual pick" plus a descriptive
tooltip on every option whose auto_detect is false. The Backend help
text explains what "manual pick" means so users aren't left wondering
about the badge.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): inline ambiguity picker

Red Playwright coverage for Batch A2 — when the server returns a 400
ambiguity body, the form must render an inline alert instead of a
toast, expose one clickable chip per candidate backend, and support
both auto-resubmit on pick and silent dismiss.

- Mocks /api/models/import-uri with the structured ambiguity body
  (error, detail, modality, candidates, hint).
- On first click of Import, the alert is visible, carries
  modality-specific copy, and shows a chip per candidate.
- Clicking a chip clears the alert, sets the Backend dropdown, and
  triggers a second POST to /api/models/import-uri.
- Dismissing the alert leaves the Backend dropdown on Auto-detect —
  no implicit backend assignment.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): inline ambiguity alert with candidate chips

Adds AmbiguityAlert — a soft, info-coloured card rendered above the URI
input when the server returns a structured 400 with { modality,
candidates }. Message is modality-aware (tts/asr/embeddings/image/
reranker/detection get purpose-written copy, everything else falls back
to a generic template). Each candidate is a clickable chip that shows a
download icon when /backends/known marks the backend as not yet
installed, so users aren't surprised by an implicit install.

ImportModel wires the alert to handleSimpleImport's error path:
- api.handleResponse now attaches { status, body } to the thrown Error
  so pages can pattern-match on structured responses instead of string
  error messages.
- handleSimpleImport detects `status === 400 && body.error === 'ambiguous
  import'` and flips into the inline-picker mode instead of toasting.
- Clicking a chip sets prefs.backend and auto-resubmits (passing the
  picked backend as an override so setPrefs's asynchrony doesn't leak
  a stale value).
- Dismissing clears the alert; changing the URI or the backend also
  clears it so a stale alert never sticks around.

Test fixtures mock GET /backends/known + POST /models/import-uri so the
Playwright specs don't depend on real network reachability.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): auto-install warning

Red Playwright coverage for Batch A3 — when the user picks a backend
whose KnownBackend.installed is false, the form must render a muted
inline note under the Backend dropdown warning that submitting will
download the backend first. Picking an installed backend or leaving
Auto-detect selected must keep the note hidden.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): auto-install warning under backend dropdown

When the user picks a backend whose KnownBackend.installed is false,
render a muted inline note under the Backend dropdown's help text
warning that submitting will download the backend first. The note
lives inside the same form-group so it lines up with the existing
hint text; it's hidden when Auto-detect is selected (the selected
backend is unknowable at that point) or when the chosen backend is
already on disk.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): drop redundant section header, adjust icons, rename HF shortcut

- Remove the "Import from URI" card-level <h2> — the page title already
  says "Import New Model" one row up, so the secondary header was
  duplicating information.
- Swap the fa-star on "Common Preferences" for fa-sliders (stars imply
  favourites/ratings; this is just a preferences block) and move the
  Custom Preferences fa-sliders-h to fa-plus-circle so the two blocks
  read as distinct rather than as two sliders.
- Rename the HF shortcut from "Search GGUF on HF" → "Browse models on
  HF" and drop the `search=gguf` filter on the linked URL. The import
  form now supports ~40 backends; hard-coding GGUF in the copy no
  longer matches the form's actual reach.
- Pure polish — no behaviour change, covered by the existing Batch A
  Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): batch B — simple/power switch, options, tabs, dialog

Adds a failing Playwright suite covering the full Batch B surface ahead
of implementation:

- B1: SimplePowerSwitch segmented control renders, toggles, persists to
  localStorage across reloads.
- B2: Simple-mode Options disclosure is collapsed by default; expanding
  exposes only Backend, Model Name, Description (no quantizations,
  mmproj, model type, or custom prefs).
- B3: Power mode has Preferences and YAML tabs with a persistent
  selection across reloads; URI/name/description typed in Simple carry
  over to Power; YAML tab swaps the primary action to Create.
- B4: Switching Power -> Simple with a custom preference set triggers
  the 3-button confirmation dialog (Keep / Discard / Cancel) with the
  documented semantics.

Tests fail against master — implementation lands in the following
commits.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): add SimplePowerSwitch segmented control

Replaces the previous "Advanced Mode / Simple Mode" toggle button in the
page header with a two-segment control that flips between Simple and
Power. The control reuses the existing .segmented CSS shared with the
Sound page for visual consistency.

Mode state is persisted to localStorage under `import-form-mode` so
reloads land on the same view (default: simple). The boolean alias
`isAdvancedMode` is retained internally to minimise diff — subsequent
commits reshape the Simple and Power surfaces independently.

Closes B1 from the Batch B Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): simple mode collapsible options, power tabs, switch dialog

Completes the Batch B surface in a single structural pass so Simple and
Power mode can evolve independently:

Simple mode
  - URI input + Ambiguity alert + Import button, plus a collapsible
    "Options" disclosure that exposes ONLY Backend, Model Name,
    Description. Quantizations / MMProj / Model Type / Diffusers fields
    / Custom Preferences are no longer rendered in Simple mode.

Power mode
  - In-page segmented "Preferences · YAML" tab strip. Active tab
    persists to localStorage under `import-form-power-tab`.
  - Preferences tab = the full existing preferences + custom prefs
    panel (no progressive disclosure yet — that's Batch D).
  - YAML tab = the existing CodeEditor. Primary button reads "Create"
    here, "Import Model" everywhere else.

Switch dialog
  - Power -> Simple with non-default prefs (advanced pref keys set,
    any custom-pref key non-empty, or YAML edited away from the
    template) opens a 3-button dialog: Keep & switch / Discard &
    switch / Cancel.
  - Keep preserves all state. Discard resets prefs + customPrefs + YAML
    to defaults. Cancel leaves the user in Power mode.

Page subtitle reflects the current surface (Simple, Power/Preferences,
Power/YAML). Estimate banner renders everywhere except Power/YAML.

Closes B2/B3/B4 from the Batch B Playwright suite.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): expand Options disclosure in Batch A tests

Batch B hid the Backend dropdown behind a collapsible Options disclosure
in Simple mode. The Batch A tests that exercise the dropdown directly
(manual-pick badge, ambiguity chip sets the selected backend, auto-
install warning) now click the disclosure toggle before asserting on
dropdown contents. Test intent is unchanged.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): strip decorative icons from field labels

The preference panel had 12 Font Awesome icons decorating field labels
(Backend, Model Name, Description, Quantizations, MMProj Quantizations,
Model Type, Pipeline Type, Scheduler Type, Enable Parameters, Embeddings,
CUDA, plus fa-link on Model URI). Every label screamed equally, flattening
the visual hierarchy.

Remove them. Keep icons where they carry meaning: page-level section
headers, URI format guide entries, primary buttons, the Simple-mode
Options disclosure, the ambiguity alert's fa-lightbulb, the auto-install
note's fa-download, and the Estimated-requirements banner's
fa-memory / fa-microchip / fa-download.

No new behaviour, no layout / spacing changes beyond removing the
orphaned icon margin. Playwright suite green.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): progressive disclosure of preference fields

Cover the Batch D visibility matrix for Power > Preferences: Quantizations,
MMProj Quantizations, and Model Type each render only for the backends that
can consume them, stay visible when the backend is unset, and preserve any
value the user already typed when toggled off and back on. Also pin the
shrunk Description textarea at rows=2.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): progressive disclosure + shorter description textarea

Gate Quantizations, MMProj Quantizations, and Model Type in the Power >
Preferences tab so each field only renders for the backends that can
actually consume it. Backend unset keeps everything visible. Hidden
fields' state is preserved (the JSX wrapper is guarded, not the
underlying prefs state) so users flipping backends back and forth don't
lose input.

Also shrink the Description textarea from rows=3 to rows=2 — it's
shared between Simple Options and Power Preferences so the change
applies to both.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): enter-to-submit in Simple mode

Red test for Batch F3 — pressing Enter in the URI input must POST
/models/import-uri, and Enter in the Description textarea must insert
a newline without submitting the form.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): enter-to-submit in Simple mode

Wrap the Simple-mode URI input + ambiguity alert + Options disclosure
in a <form> whose onSubmit calls handleSimpleImport. Pressing Enter in
the URI input (or any Simple-mode text input) now submits the import
without having to move the mouse to the header button. The Description
textarea keeps its native behaviour — Enter inserts a newline.

A hidden submit button is included because the visible Import button
lives outside the form in the page header; some browsers only fire
implicit Enter-submit when the form contains a submit-capable element.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import,SearchableSelect,components): aria-hidden on decorative icons

Every Font Awesome icon in the import form is decorative — its meaning
is already conveyed by adjacent visible text. Adding aria-hidden="true"
prevents screen readers from announcing the unicode glyph point as
content. Covers ImportModel.jsx (all remaining <i> glyphs) and
SearchableSelect.jsx (the trigger chevron).

AmbiguityAlert and SimplePowerSwitch already set aria-hidden on their
icons when the components landed in Batches A and B — no change needed
there.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(SearchableSelect): responsive dropdown maxHeight + hover focus guard

F2 — replace fixed pixel heights with min(pixel, vh) so the dropdown
and its inner scroll region don't overflow short viewports. Outer
container: 260px -> min(260px, 60vh); inner listbox: 200px ->
min(200px, 50vh). Tall viewports still get the original pixel caps.

F5 — short-circuit onMouseEnter when the hovered row is already the
focused row. Avoids queueing a setFocusIndex call (and a render) for
every mousemove inside the same item — the state would be identical.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* ui(import): aria-label on custom preference rows

The Key / Value inputs and trash button in each Custom Preferences row
previously relied on placeholder text alone. Placeholders are not
accessible names — they vanish on input and screen readers do not
announce them consistently. Add row-indexed aria-labels so assistive
tech can distinguish "Preference key for row 1" from "row 2", and give
the trash button an explicit "Remove this preference" label.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* test(ui/import): modality chip row

Red tests for Batch E — a horizontal modality chip row that filters the
Backend dropdown by modality. Covers visibility in Simple-mode Options
and Power/Preferences (and absence in Power/YAML), filter behaviour,
mismatched-backend clearing with toast, ambiguity-alert auto-selection,
and radiogroup keyboard navigation.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* feat(ui/import): add ModalityChips component + filter integration

Horizontal chip row (Any, Text, Speech, TTS, Image, Embeddings,
Rerankers, Detection, VAD) filters the Backend dropdown options to the
selected modality. Default is Any — no filter, current behaviour.

- New ModalityChips component (radiogroup pattern, roving tabindex,
  arrow-key navigation, Home/End).
- buildBackendOptions now accepts an optional modalityFilter so grouped
  output is narrowed before rendering.
- Chips render inside Simple-mode Options disclosure and Power >
  Preferences tab. Power > YAML stays unaffected.
- Switching the filter drops a mismatched backend selection and
  surfaces a toast so the auto-clear is visible.
- Ambiguity alerts auto-activate the matching chip so users see only
  relevant backends even if they dismiss the alert.

Tightens the Batch E tests' option-matching to the label <span> so the
"↵" keybind hint on the focused row doesn't break accessible-name
lookups.

Assisted-by: Claude:claude-opus-4-7[1m] [Agent]

* fix(ui/import): rename Power to Advanced + stop URI-formats toggle from submitting form

The "Supported URI Formats" disclosure button inside the Simple-mode form
lacked an explicit type attribute, so it defaulted to type="submit". Every
click triggered the form's onSubmit and surfaced the empty-URI validation
toast ("Please enter a model URI"). Marking it type="button" lets it
behave as a pure toggle.

While here, rename the user-visible "Power" label to "Advanced" in the
mode switch (button text + tooltip) and the Power-mode tab's aria-label,
matching the term users actually expect. The internal mode key stays
'power' so tests, localStorage, and data-testid selectors are untouched.

Assisted-by: Claude:claude-opus-4-7

* fix(system): fall back to cpu when meta backend lacks default capability

Meta backends like vllm and sglang enumerate concrete variants for
nvidia/amd/intel/cpu but omit a default: catch-all entry. On a no-GPU
host the reported capability is "default", so the previous Capability()
returned "default" unconditionally on a miss — IsCompatibleWith then saw
no "default" key and filtered the meta out of AvailableBackends. The
import flow's auto-install step then failed with "no backend found with
name <meta>", contradicting the UI's promise that the backend would be
downloaded on demand.

Try the explicit "default" key first, then fall back to "cpu" before
giving up. vllm now resolves to cpu-vllm on CPU-only Linux without
touching the gallery YAML.

Assisted-by: Claude:claude-opus-4-7
2026-04-22 22:42:37 +02:00
orbisai0security
bbeacf140d fix: remove unsafe sprintf() in grpc-server.cpp (#9486)
fix: V-001 security vulnerability

Automated security fix generated by Orbis Security AI
2026-04-22 21:57:29 +02:00
LocalAI [bot]
6820ec468f chore(model gallery): 🤖 add 1 new models via gallery agent (#9491)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-22 21:56:11 +02:00
Ettore Di Giacinto
20baec77ab feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis (#9480)
* feat(face-recognition): add insightface backend for 1:1 verify, 1:N identify, embedding, detection, analysis

Adds face recognition as a new first-class capability in LocalAI via the
`insightface` Python backend, with a pluggable two-engine design so
non-commercial (insightface model packs) and commercial-safe
(OpenCV Zoo YuNet + SFace) models share the same gRPC/HTTP surface.

New gRPC RPCs (backend/backend.proto):
  * FaceVerify(FaceVerifyRequest) returns FaceVerifyResponse
  * FaceAnalyze(FaceAnalyzeRequest) returns FaceAnalyzeResponse

Existing Embedding and Detect RPCs are reused (face image in
PredictOptions.Images / DetectOptions.src) for face embedding and
face detection respectively.

New HTTP endpoints under /v1/face/:
  * verify     — 1:1 image pair same-person decision
  * analyze    — per-face age + gender (emotion/race reserved)
  * register   — 1:N enrollment; stores embedding in vector store
  * identify   — 1:N recognition; detect → embed → StoresFind
  * forget     — remove a registered face by opaque ID

Service layer (core/services/facerecognition/) introduces a
`Registry` interface with one in-memory `storeRegistry` impl backed
by LocalAI's existing local-store gRPC vector backend. HTTP handlers
depend on the interface, not on StoresSet/StoresFind directly, so a
persistent PostgreSQL/pgvector implementation can be slotted in via a
single constructor change in core/application (TODO marker in the
package doc).

New usecase flag FLAG_FACE_RECOGNITION; insightface is also wired
into FLAG_DETECTION so /v1/detection works for face bounding boxes.

Gallery (backend/index.yaml) ships three entries:
  * insightface-buffalo-l   — SCRFD-10GF + ArcFace R50 + genderage
                              (~326MB pre-baked; non-commercial research use only)
  * insightface-opencv      — YuNet + SFace (~40MB pre-baked; Apache 2.0)
  * insightface-buffalo-s   — SCRFD-500MF + MBF (runtime download; non-commercial)

Python backend (backend/python/insightface/):
  * engines.py — FaceEngine protocol with InsightFaceEngine and
    OnnxDirectEngine; resolves model paths relative to the backend
    directory so the same gallery config works in docker-scratch and
    in the e2e-backends rootfs-extraction harness.
  * backend.py — gRPC servicer implementing Health, LoadModel, Status,
    Embedding, Detect, FaceVerify, FaceAnalyze.
  * install.sh — pre-bakes buffalo_l + OpenCV YuNet/SFace inside the
    backend directory so first-run is offline-clean (the final scratch
    image only preserves files under /<backend>/).
  * test.py — parametrized unit tests over both engines.

Tests:
  * Registry unit tests (go test -race ./core/services/facerecognition/...)
    — in-memory fake grpc.Backend, table-driven, covers register/
    identify/forget/error paths + concurrent access.
  * tests/e2e-backends/backend_test.go extended with face caps
    (face_detect, face_embed, face_verify, face_analyze); relative
    ordering + configurable verifyCeiling per engine.
  * Makefile targets: test-extra-backend-insightface-buffalo-l,
    -opencv, and the -all aggregate.
  * CI: .github/workflows/test-extra.yml gains tests-insightface-grpc,
    auto-triggered by changes under backend/python/insightface/.

Docs:
  * docs/content/features/face-recognition.md — feature page with
    license table, quickstart (defaults to the commercial-safe model),
    models matrix, API reference, 1:N workflow, storage caveats.
  * Cross-refs in object-detection.md, stores.md, embeddings.md, and
    whats-new.md.
  * Contributor README at backend/python/insightface/README.md.

Verified end-to-end:
  * buffalo_l: 6/6 specs (health, load, face_detect, face_embed,
    face_verify, face_analyze).
  * opencv: 5/5 specs (same minus face_analyze — SFace has no
    demographic head; correctly skipped via BACKEND_TEST_CAPS).

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): move engine selection to model gallery, collapse backend entries

The previous commit put engine/model_pack options on backend gallery
entries (`backend/index.yaml`). That was wrong — `GalleryBackend`
(core/gallery/backend_types.go:32) has no `options` field, so the
YAML decoder silently dropped those keys and all three "different
insightface-*" backend entries resolved to the same container image
with no distinguishing configuration.

Correct split:

  * `backend/index.yaml` now has ONE `insightface` backend entry
    shipping the CPU + CUDA 12 container images. The Python backend
    bundles both the non-commercial insightface model packs
    (buffalo_l / buffalo_s) and the commercial-safe OpenCV Zoo
    weights (YuNet + SFace); the active engine is selected at
    LoadModel time via `options: ["engine:..."]`.

  * `gallery/index.yaml` gains three model entries —
    `insightface-buffalo-l`, `insightface-opencv`,
    `insightface-buffalo-s` — each setting the appropriate
    `overrides.backend` + `overrides.options` so installing one
    actually gives the user the intended engine. This matches how
    `rfdetr-base` lives in the model gallery against the `rfdetr`
    backend.

The earlier e2e tests passed despite this bug because the Makefile
targets pass `BACKEND_TEST_OPTIONS` directly to LoadModel via gRPC,
bypassing any gallery resolution entirely. No code changes needed.

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): cover all supported models in the gallery + drop weight baking

Follows up on the model-gallery split: adds entries for every model
configuration either engine actually supports, and switches weight
delivery from image-baked to LocalAI's standard gallery mechanism.

Gallery now has seven `insightface-*` model entries (gallery/index.yaml):

  insightface (family)  — non-commercial research use
    • buffalo-l   (326MB)  — SCRFD-10GF + ResNet50 + genderage, default
    • buffalo-m   (313MB)  — SCRFD-2.5GF + ResNet50 + genderage
    • buffalo-s   (159MB)  — SCRFD-500MF + MBF + genderage
    • buffalo-sc  (16MB)   — SCRFD-500MF + MBF, recognition only
                             (no landmarks, no demographics — analyze
                             returns empty attributes)
    • antelopev2  (407MB)  — SCRFD-10GF + ResNet100@Glint360K + genderage

  OpenCV Zoo family — Apache 2.0 commercial-safe
    • opencv       — YuNet + SFace fp32 (~40MB)
    • opencv-int8  — YuNet + SFace int8 (~12MB, ~3x smaller, faster on CPU)

Model weights are no longer baked into the backend image. The image
now ships only the Python runtime + libraries (~275MB content size,
~1.18GB disk vs ~1.21GB when weights were baked). Weights flow through
LocalAI's gallery mechanism:

  * OpenCV variants list `files:` with ONNX URIs + SHA-256, so
    `local-ai models install insightface-opencv` pulls them into the
    models directory exactly like any other gallery-managed model.

  * insightface packs (upstream distributes .zip archives only, not
    individual ONNX files) auto-download on first LoadModel via
    FaceAnalysis' built-in machinery, rooted at the LocalAI models
    directory so they live alongside everything else — same pattern
    `rfdetr` uses with `inference.get_model()`.

Backend changes (backend/python/insightface/):

  * backend.py — LoadModel propagates `ModelOptions.ModelPath` (the
    LocalAI models directory) to engines via a `_model_dir` hint.
    This replaces the earlier ModelFile-dirname approach; ModelPath
    is the canonical "models directory" variable set by the Go loader
    (pkg/model/initializers.go:144) and is always populated.

  * engines.py::_resolve_model_path — picks up `model_dir` and searches
    it (plus basename-in-model-dir) before falling back to the dev
    script-dir. This is how OnnxDirectEngine finds gallery-downloaded
    YuNet/SFace files by filename only.

  * engines.py::_flatten_insightface_pack — new helper that works
    around an upstream packaging inconsistency: buffalo_l/s/sc zips
    expand flat, but buffalo_m and antelopev2 zips wrap their ONNX
    files in a redundant `<name>/` directory. insightface's own
    loader looks one level too shallow and fails. We call
    `ensure_available()` explicitly, flatten if nested, then hand to
    FaceAnalysis.

  * engines.py::InsightFaceEngine.prepare — root-resolution order now
    includes the `_model_dir` hint so packs download into the LocalAI
    models directory by default.

  * install.sh — no longer pre-downloads any weights. Everything is
    gallery-managed now.

  * smoke.py (new) — parametrized smoke test that iterates over every
    gallery configuration, simulating the LocalAI install flow
    (creates a models dir, fetches OpenCV files with checksum
    verification, lets insightface auto-download its packs), then
    runs detect + embed + verify (+ analyze where supported) through
    the in-process BackendServicer.

  * test.py — OnnxDirectEngineTest no longer hardcodes `/models/opencv/`
    paths; downloads ONNX files to a temp dir at setUpClass time and
    passes ModelPath accordingly.

Registry change (core/services/facerecognition/store_registry.go):

  * `dim=0` in NewStoreRegistry now means "accept whatever dimension
    arrives" — needed because the backend supports 512-d ArcFace/MBF
    and 128-d SFace via the same Registry. A non-zero dim still fails
    fast with ErrDimensionMismatch.

  * core/application plumbs `faceEmbeddingDim = 0`, explaining the
    rationale in the comment.

Backend gallery description updated to reflect that the image carries
no weights — it's just Python + engines.

Smoke-tested all 7 configurations against the rebuilt image (with the
flatten fix applied), exit 0:

    PASS: insightface-buffalo-l    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-sc   faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-s    faces=6 dim=512 same-dist=0.000
    PASS: insightface-buffalo-m    faces=6 dim=512 same-dist=0.000
    PASS: insightface-antelopev2   faces=6 dim=512 same-dist=0.000
    PASS: insightface-opencv       faces=6 dim=128 same-dist=0.000
    PASS: insightface-opencv-int8  faces=6 dim=128 same-dist=0.000
    7/7 passed

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): pre-fetch OpenCV ONNX for e2e target; drop stale pre-baked claim

CI regression from the previous commit: I moved OpenCV Zoo weight
delivery to LocalAI's gallery `files:` mechanism, but the
test-extra-backend-insightface-opencv target was still passing
relative paths `detector_onnx:models/opencv/yunet.onnx` in
BACKEND_TEST_OPTIONS. The e2e suite drives LoadModel directly over
gRPC without going through the gallery, so those relative paths
resolved to nothing and OpenCV's ONNXImporter failed:

    LoadModel failed: Failed to load face engine:
    OpenCV(4.13.0) ... Can't read ONNX file: models/opencv/yunet.onnx

Fix: add an `insightface-opencv-models` prerequisite target that
fetches the two ONNX files (YuNet + SFace) to a deterministic host
cache at /tmp/localai-insightface-opencv-cache/, verifies SHA-256,
and skips the download on re-runs. The opencv test target depends on
it and passes absolute paths in BACKEND_TEST_OPTIONS, so the backend
finds the files via its normal absolute-path resolution branch.

Also refresh the buffalo_l comment: it no longer says "pre-baked"
(nothing is — the pack auto-downloads from upstream's GitHub release
on first LoadModel, same as in CI).

Locally verified: `make test-extra-backend-insightface-opencv` passes
5/5 specs (health, load, face_detect, face_embed, face_verify).

Assisted-by: Claude:claude-opus-4-7

* feat(face-recognition): add POST /v1/face/embed + correct /v1/embeddings docs

The docs promised that /v1/embeddings returns face vectors when you
send an image data-URI. That was never true: /v1/embeddings is
OpenAI-compatible and text-only by contract — its handler goes
through `core/backend/embeddings.go::ModelEmbedding`, which sets
`predictOptions.Embeddings = s` (a string of TEXT to embed) and never
populates `predictOptions.Images[]`. The Python backend's Embedding
gRPC method does handle Images[] (that's how /v1/face/register reaches
it internally via `backend.FaceEmbed`), but the HTTP embeddings
endpoint wasn't wired to populate it.

Rather than overload /v1/embeddings with image-vs-text detection —
messy, and the endpoint is OpenAI-compatible by design — add a
dedicated /v1/face/embed endpoint that wraps `backend.FaceEmbed`
(already used internally by /v1/face/register and /v1/face/identify).

Matches LocalAI's convention of a dedicated path per non-standard flow
(/v1/rerank, /v1/detection, /v1/face/verify etc.).

Response:

    {
      "embedding": [<dim> floats, L2-normed],
      "dim": int,           // 512 for ArcFace R50 / MBF, 128 for SFace
      "model": "<name>"
    }

Live-tested on the opencv engine: returns a 128-d L2-normalized vector
(sum(x^2) = 1.0000). Sentinel in docs updated to note /v1/embeddings
is text-only and point image users at /v1/face/embed instead.

Assisted-by: Claude:claude-opus-4-7

* fix(http): map malformed image input + gRPC status codes to proper 4xx

Image-input failures on LocalAI's single-image endpoints (/v1/detection,
/v1/face/{verify,analyze,embed,register,identify}) have historically
returned 500 — even when the client was the one who sent garbage.
Classic example: you POST an "image" that isn't a URL, isn't a
data-URI, and isn't a valid JPEG/PNG — the server shouldn't claim
that's its fault.

Two helpers land in core/http/endpoints/localai/images.go and every
single-image handler is switched over:

  * decodeImageInput(s)
      Wraps utils.GetContentURIAsBase64 and turns any failure
      (invalid URL, not a data-URI, download error, etc.) into
      echo.NewHTTPError(400, "invalid image input: ...").

  * mapBackendError(err)
      Inspects the gRPC status on a backend call error and maps:
        INVALID_ARGUMENT     → 400 Bad Request
        NOT_FOUND            → 404 Not Found
        FAILED_PRECONDITION  → 412 Precondition Failed
        Unimplemented        → 501 Not Implemented
      All other codes fall through unchanged (still 500).

Before, my 1×1 PNG error-path test returned:
    HTTP 500 "rpc error: code = InvalidArgument desc = failed to decode one or both images"
After:
    HTTP 400 "failed to decode one or both images"

Scope-limited to the LocalAI single-image endpoints. The multi-modal
paths (middleware/request.go, openresponses/responses.go,
openai/realtime.go) intentionally log-and-skip individual media parts
when decoding fails — different design intent (graceful degradation
of a multi-part message), not a 400-worthy failure. Left untouched.

Live-verified: every error case in /tmp/face_errors.py now returns
4xx with a meaningful message; the "image with no face (1x1 PNG)"
case specifically went from 500 → 400.

Assisted-by: Claude:claude-opus-4-7

* refactor(face-recognition): insightface packs go through gallery files:, drop FaceAnalysis

Follows up on the discovery that LocalAI's gallery `files:` mechanism
handles archives (zip, tar.gz, …) via mholt/archiver/v3 — the rhasspy
piper voices use exactly this pattern. Insightface packs are zip
archives, so we can now deliver them the same way every other
gallery-managed model gets delivered: declaratively, checksum-verified,
through LocalAI's standard download+extract pipeline.

Two changes:

1. Gallery (gallery/index.yaml) — every insightface-* entry gains a
   `files:` list with the pack zip's URI + SHA-256. `local-ai models
   install insightface-buffalo-l` now fetches the zip, verifies the
   hash, and extracts it into the models directory. No more reliance
   on insightface's library-internal `ensure_available()` auto-download
   or its hardcoded `BASE_REPO_URL`.

2. InsightFaceEngine (backend/python/insightface/engines.py) — drops
   the FaceAnalysis wrapper and drives insightface's `model_zoo`
   directly. The ~50 lines FaceAnalysis provides — glob ONNX files,
   route each through `model_zoo.get_model()`, build a
   `{taskname: model}` dict, loop per-face at inference — are
   reimplemented in `InsightFaceEngine`. The actual inference classes
   (RetinaFace, ArcFaceONNX, Attribute, Landmark) are still
   insightface's — we only replicate the glue, so drift risk against
   upstream is minimal.

   Why drop FaceAnalysis: it hard-codes a `<root>/models/<name>/*.onnx`
   layout that doesn't match what LocalAI's zip extraction produces.
   LocalAI unpacks archives flat into `<models_dir>`. Upstream packs
   are inconsistent — buffalo_l/s/sc ship ONNX at the zip root (lands
   at `<models_dir>/*.onnx`), buffalo_m/antelopev2 wrap in a redundant
   `<name>/` dir (lands at `<models_dir>/<name>/*.onnx`). The new
   `_locate_insightface_pack` helper searches both locations plus
   legacy paths and returns whichever has ONNX files. Replaces the
   earlier `_flatten_insightface_pack` helper (which tried to fight
   FaceAnalysis's layout expectations; now we just find the files
   wherever they are).

Net effect for users: install once via LocalAI's managed flow,
weights live alongside every other model, progress shows in the
jobs endpoint, no first-load network call. Same API surface,
cleaner plumbing.

Assisted-by: Claude:claude-opus-4-7

* fix(face-recognition): CI's insightface e2e path needs the pack pre-fetched

The e2e suite drives LoadModel over gRPC without going through LocalAI's
gallery flow, so the engine's `_model_dir` option (normally populated
from ModelPath) is empty. Previously the insightface target relied on
FaceAnalysis auto-download to paper over this, but we dropped
FaceAnalysis in favor of direct model_zoo calls — so the buffalo_l
target started failing at LoadModel with "no insightface pack found".

Mirror the opencv target's pre-fetch pattern: download buffalo_sc.zip
(same SHA as the gallery entry), extract it on the host, and pass
`root:<dir>` so the engine locates the pack without needing
ModelPath. Switched to buffalo_sc (smallest pack, ~16MB) to keep CI
fast; it covers the same insightface engine code path as buffalo_l.

Face analyze cap dropped since buffalo_sc has no age/gender head.

Assisted-by: Claude:claude-opus-4-7[1m]

* feat(face-recognition): surface face-recognition in advertised feature maps

The six /v1/face/* endpoints were missing from every place LocalAI
advertises its feature surface to clients:

  * api_instructions — the machine-readable capability index at
    GET /api/instructions. Added `face-recognition` as a dedicated
    instruction area with an intro that calls out the in-memory
    registry caveat and the /v1/face/embed vs /v1/embeddings split.
  * auth/permissions — added FeatureFaceRecognition constant, routed
    all six face endpoints through it so admins can gate them per-user
    like any other API feature. Default ON (matches the other API
    features).
  * React UI capabilities — CAP_FACE_RECOGNITION symbol mapped to
    FLAG_FACE_RECOGNITION. Declared only for now; the Face page is a
    follow-up (noted in the plan).

Instruction count bumped 9 → 10; test updated.

Assisted-by: Claude:claude-opus-4-7[1m]

* docs(agents): capture advertising-surface steps in the endpoint guide

Before this change, adding a new /v1/* endpoint reliably missed one or
more of: the swagger @Tags annotation, the /api/instructions registry,
the auth RouteFeatureRegistry, and the React UI CAP_* symbol. The
endpoint would work but be invisible to API consumers, admins, and the
UI — and nothing in the existing docs said to look in those places.

Extend .agents/api-endpoints-and-auth.md with a new "Advertising
surfaces" section covering all four surfaces (swagger tags, /api/
instructions, capabilities.js, docs/), and expand the closing checklist
so it's impossible to ship a feature without visiting each one. Hoist a
one-liner reminder into AGENTS.md's Quick Reference so agents skim it
before diving in.

Assisted-by: Claude:claude-opus-4-7[1m]
2026-04-22 21:55:41 +02:00
Richard Palethorpe
d16f19f1eb fix(kokoros): Build and publish the backend images from CI/CD (#9487)
* fix(kokoros): Build and publish the backend images from CI/CD

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* Delete .claude/agents

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/commands

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/settings.json

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

* Delete .claude/skills

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-22 13:19:55 +02:00
902 changed files with 103091 additions and 16214 deletions

View File

@@ -8,6 +8,7 @@ Create the backend directory under the appropriate location:
- **Python backends**: `backend/python/<backend-name>/`
- **Go backends**: `backend/go/<backend-name>/`
- **C++ backends**: `backend/cpp/<backend-name>/`
- **Rust backends**: `backend/rust/<backend-name>/`
For Python backends, you'll typically need:
- `backend.py` - Main gRPC server implementation
@@ -18,9 +19,70 @@ For Python backends, you'll typically need:
- `run.sh` - Runtime script
- `test.py` / `test.sh` - Test files
## 2. Add Build Configurations to `.github/workflows/backend.yml`
For Rust backends, you'll typically need (see `backend/rust/kokoros/` as a reference):
- `Cargo.toml` - Crate manifest; depend on the upstream project as a submodule under `sources/`
- `build.rs` - Invokes `tonic_build` to generate gRPC stubs from `backend/backend.proto` (use the `BACKEND_PROTO_PATH` env var so the Makefile can inject the canonical copy)
- `src/` - The gRPC server implementation (implement `Backend` via `tonic`)
- `Makefile` - Copies `backend.proto` into the crate, runs `cargo build --release`, then `package.sh`
- `package.sh` - Uses `ldd` to bundle the binary's dynamic deps and `ld.so` into `package/lib/`
- `run.sh` - Sets `LD_LIBRARY_PATH`/`SSL_CERT_DIR` and execs the binary via the bundled `lib/ld.so`
- `sources/<UpstreamProject>/` - Git submodule with the upstream Rust crate
Add build matrix entries for each platform/GPU type you want to support. Look at similar backends (e.g., `chatterbox`, `faster-whisper`) for reference.
## 2. Add Build Configurations to `.github/backend-matrix.yml`
The build matrix is data-only YAML at `.github/backend-matrix.yml` (not inside `backend.yml` itself). `backend.yml` (master push) and `backend_pr.yml` (PR) load it via `scripts/changed-backends.js`, which also handles per-file path filtering so only touched backends rebuild on PRs and master pushes alike. Add build matrix entries to `.github/backend-matrix.yml` for each platform/GPU type you want to support. Look at similar backends for reference — `chatterbox`/`faster-whisper` for Python, `piper`/`silero-vad` for Go, `kokoros` for Rust.
**Without an entry here no image is ever built or pushed, and the gallery entry in `backend/index.yaml` will point at a tag that does not exist.** The `dockerfile:` field must point at `./backend/Dockerfile.<lang>` matching the language bucket from step 1 (e.g. `Dockerfile.python`, `Dockerfile.golang`, `Dockerfile.rust`). The `tag-suffix` must match the `uri:` in the corresponding `backend/index.yaml` image entry exactly.
**`scripts/changed-backends.js` registration — REQUIRED for any new dockerfile suffix.** This is the single most common omission, because it has no effect on the PR that adds the backend (when no prior path filter could catch it anyway) — it only breaks the *next* PR that touches your backend's directory, which then gets zero CI jobs and looks broken for unrelated reasons. Edit `scripts/changed-backends.js:inferBackendPath` and add a branch BEFORE the more-generic suffixes:
```js
if (item.dockerfile.endsWith("<your-dockerfile-suffix>")) {
return `backend/cpp/<your-backend>/`; // or backend/python|go|rust/...
}
```
The `endsWith()` test is against the matrix entry's `dockerfile:` value (e.g. `./backend/Dockerfile.ds4``endsWith("ds4")`). Specificity order matters here just like it does for importers: more-specific suffixes go BEFORE more-generic ones (e.g. `ds4` before `llama-cpp` even though both end with letters, because some upstream might one day call itself `super-ds4-llama-cpp`). Verify locally before pushing:
```bash
# Confirm your dockerfile suffix is unique enough
node -e "
const yaml = require('js-yaml'); const fs = require('fs');
const m = yaml.load(fs.readFileSync('.github/backend-matrix.yml','utf8'));
for (const e of m.include.filter(e => e.backend === '<your-backend>')) {
console.log(e.dockerfile, '->', e.dockerfile.endsWith('<suffix>'));
}"
```
A quick way to find the right insertion point: `grep -n 'item.dockerfile.endsWith' scripts/changed-backends.js`.
**`bump_deps.yaml` registration — REQUIRED for any backend pinning an upstream commit.** If your backend's Makefile has a `*_VERSION?=<sha>` pin to a third-party repo, the daily auto-bump bot at `.github/workflows/bump_deps.yaml` won't notice it unless you register the backend in its matrix. The bot runs `.github/bump_deps.sh` which `grep`s for `^$VAR?=` in the Makefile you list — so the pin MUST live in the Makefile (not in a separate shell script). The bump for ds4 (#9761) had to walk this back because the original landed the pin in `prepare.sh`, which the bot can't see. Pattern (for `antirez/ds4`):
```yaml
# .github/workflows/bump_deps.yaml
matrix:
include:
- repository: "antirez/ds4"
variable: "DS4_VERSION"
branch: "main"
file: "backend/cpp/ds4/Makefile"
```
And the corresponding Makefile shape (mirror `backend/cpp/llama-cpp/Makefile`):
```makefile
DS4_VERSION?=ae302c2fa18cc6d9aefc021d0f27ae03c9ad2fc0
DS4_REPO?=https://github.com/antirez/ds4
...
ds4:
mkdir -p ds4
cd ds4 && git init -q && \
git remote add origin $(DS4_REPO) && \
git fetch --depth 1 origin $(DS4_VERSION) && \
git checkout FETCH_HEAD
```
If you have a `prepare.sh` doing the clone, delete it — the recipe belongs in the Makefile target so `make purge && make` works as a clean-and-rebuild and so the bump bot finds the pin.
**Placement in file:**
- CPU builds: Add after other CPU builds (e.g., after `cpu-chatterbox`)
@@ -29,9 +91,17 @@ Add build matrix entries for each platform/GPU type you want to support. Look at
**Additional build types you may need:**
- ROCm/HIP: Use `build-type: 'hipblas'` with `base-image: "rocm/dev-ubuntu-24.04:7.2.1"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"`
- Intel/SYCL: Use `build-type: 'intel'` or `build-type: 'sycl_f16'`/`sycl_f32` with `base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"`
- L4T (ARM): Use `build-type: 'l4t'` with `platforms: 'linux/arm64'` and `runs-on: 'ubuntu-24.04-arm'`
**Per-arch native builds (`linux/amd64` + `linux/arm64`):**
Multi-arch backends are NOT a single matrix entry with `platforms: 'linux/amd64,linux/arm64'`. Instead, add **two** entries — one with `platforms: 'linux/amd64'` + `platform-tag: 'amd64'` + `runs-on: 'ubuntu-latest'`, one with `platforms: 'linux/arm64'` + `platform-tag: 'arm64'` + `runs-on: 'ubuntu-24.04-arm'` — both sharing the same `tag-suffix`. The script detects the shared `tag-suffix` and emits a `merge-matrix` entry, so `backend-merge-jobs` (in `backend.yml`/`backend_pr.yml`) automatically assembles the manifest list from per-arch digest artifacts. See `-cpu-faster-whisper` in `.github/backend-matrix.yml` for a reference shape.
**llama-cpp / ik-llama-cpp / turboquant variants only — `builder-base-image`:**
Entries whose `dockerfile` is `./backend/Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` must also set a `builder-base-image` field pointing at a prebuilt base from `quay.io/go-skynet/ci-cache:base-grpc-*` (CI builds these via `.github/workflows/base-images.yml`). The mapping is by `(build-type, platforms)` — see existing entries for the pattern. CI uses these prebuilt bases to skip the gRPC compile (~2535 min cold). Local `make backends/<name>` ignores `builder-base-image` and uses the from-source path inside the Dockerfile, so you don't need quay access for local builds.
## 3. Add Backend Metadata to `backend/index.yaml`
**Step 3a: Add Meta Definition**
@@ -42,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
## 4. Update the Makefile
The Makefile needs to be updated in several places to support building and testing the new backend:
@@ -56,24 +128,28 @@ Add `backends/<backend-name>` to the `.NOTPARALLEL` line (around line 2) to prev
**Step 4b: Add to `prepare-test-extra`**
Add the backend to the `prepare-test-extra` target (around line 312) to prepare it for testing:
Add the backend to the `prepare-test-extra` target to prepare it for testing. Use the path matching your language bucket (`backend/python/`, `backend/go/`, `backend/rust/`, …):
```makefile
prepare-test-extra: protogen-python
...
$(MAKE) -C backend/python/<backend-name>
$(MAKE) -C backend/<lang>/<backend-name>
```
For Rust backends the target is usually the crate build target itself (e.g. `$(MAKE) -C backend/rust/<backend-name> <backend-name>-grpc`) so the binary is in place before `test` runs.
**Step 4c: Add to `test-extra`**
Add the backend to the `test-extra` target (around line 319) to run its tests:
Add the backend to the `test-extra` target to run its tests — applies to Go and Rust backends too, not only Python:
```makefile
test-extra: prepare-test-extra
...
$(MAKE) -C backend/python/<backend-name> test
$(MAKE) -C backend/<lang>/<backend-name> test
```
Each backend's own `Makefile` should define a `test` target so this line works regardless of language. Integration tests that need large model downloads should be gated behind an env var (see `backend/rust/kokoros/`'s `KOKOROS_MODEL_PATH` pattern) so CI only runs unit tests.
**Step 4d: Add Backend Definition**
Add a backend definition variable in the backend definitions section (around line 428-457). The format depends on the backend type:
@@ -93,6 +169,13 @@ BACKEND_<BACKEND_NAME> = <backend-name>|python|./backend|false|true
BACKEND_<BACKEND_NAME> = <backend-name>|golang|.|false|true
```
**For Rust backends**:
```makefile
BACKEND_<BACKEND_NAME> = <backend-name>|rust|.|false|true
```
The language field (`python`/`golang`/`rust`/…) must match a `backend/Dockerfile.<lang>` file.
**Step 4e: Generate Docker Build Target**
Add an eval call to generate the docker-build target (around line 480-501):
@@ -120,7 +203,7 @@ docker-build-backends: ... docker-build-<backend-name>
After adding a new backend, verify:
- [ ] Backend directory structure is complete with all necessary files
- [ ] Build configurations added to `.github/workflows/backend.yml` for all desired platforms
- [ ] Build configurations added to `.github/backend-matrix.yml` for all desired platforms (per-arch entries with `platform-tag` for multi-arch; `builder-base-image` for llama-cpp / ik-llama-cpp / turboquant)
- [ ] Meta definition added to `backend/index.yaml` in the `## metas` section
- [ ] Image entries added to `backend/index.yaml` for all build variants (latest + development)
- [ ] Tag suffixes match between workflow file and index.yaml
@@ -153,6 +236,29 @@ ls /tmp/check # expect the bundled .so files + symlinks
Then boot it inside a fresh `ubuntu:24.04` (which intentionally does *not* have the lib installed) to confirm it actually loads from the backend dir.
## Importer integration
When you add a new backend, you MUST also make it importable via the model import form (`/import-model`). The import form dropdown is sourced dynamically from `GET /backends/known` — it reads the importer registry at `core/gallery/importers/importers.go`, so the steps below are the ONLY way to make your backend show up.
Required steps:
1. **If your backend has unambiguous detection signals** (unique file extension, HF `pipeline_tag`, unique repo name pattern, unique artefact like `modules.json`):
- Create an importer file at `core/gallery/importers/<backend>.go` following the Match/Import pattern in `llama-cpp.go`.
- Register it in `importers.go:defaultImporters` in **specificity order** — more specific detectors must appear BEFORE more generic ones (e.g. `sentencetransformers` before `transformers`, `stablediffusion-ggml` before `llama-cpp`, `vllm-omni` before `vllm`). First match wins.
2. **If your backend is a drop-in replacement** (same artefacts as another backend, e.g. `ik-llama-cpp` and `turboquant` both consume GGUF the same way `llama-cpp` does):
- Do NOT create a new importer. Extend the existing importer's `Import()` to swap the emitted `backend:` field when `preferences.backend` matches. See `llama-cpp.go` for the pattern.
3. **If your backend has no reliable auto-detect signal** (preference-only — e.g. `sglang`, `tinygrad`, `whisperx`):
- Do NOT create an importer. Instead add the backend name to the curated pref-only slice in `core/http/endpoints/localai/backend.go` that feeds `/backends/known`. A single line addition.
4. **Always** add a table-driven test in `core/gallery/importers/importers_test.go` (Ginkgo/Gomega):
- Use a real public HuggingFace repo URI as the test fixture (existing tests already hit the live HF API — follow that pattern).
- Cover detection (auto-match without preferences), preference-override (explicit `backend:` in preferences wins), and — if the backend's modality has a common `pipeline_tag` but ambiguous artefacts — an ambiguity test asserting `errors.Is(err, importers.ErrAmbiguousImport)`.
Rules of thumb:
- When in doubt, lean pref-only. A wrong auto-detect is worse than a forced preference.
- Never silently emit a modality mismatch (e.g. emit `llama-cpp` for a TTS repo because `.gguf` is present). Return `ErrAmbiguousImport` instead.
- Registration order is the single most common source of bugs. Check by running `go test ./core/gallery/importers/...` — the existing suite will fail if you've shadowed a pre-existing detector.
## 6. Example: Adding a Python Backend
For reference, when `moonshine` was added:

View File

@@ -2,6 +2,8 @@
This guide covers how to add new API endpoints and properly integrate them with the auth/permissions system.
> **Before you ship a new endpoint or capability surface**, re-read the [checklist at the bottom of this file](#checklist). LocalAI advertises its feature surface in several independent places — miss any one of them and clients/admins/UI won't know the endpoint exists.
## Architecture overview
Authentication and authorization flow through three layers:
@@ -234,6 +236,76 @@ Use these HTTP status codes:
If your endpoint should be tracked for usage (token counts, request counts), add the `usageMiddleware` to its middleware chain. See `core/http/middleware/usage.go` and how it's applied in `routes/openai.go`.
## Advertising surfaces — where to register a new capability
Beyond routing and auth, LocalAI publishes its capability surface in **four independent places**. When you add an endpoint — especially one introducing a net-new capability like a new media type or a new auth-gated feature — you must update every relevant surface. These aren't optional: missing them means the endpoint works but is invisible to clients, admins, and the UI.
### 1. Swagger `@Tags` annotation (mandatory)
Every handler needs a swagger block so the endpoint appears in `/swagger/index.html` and in the `/api/instructions` output. The `@Tags` value is what groups the endpoint into a capability area:
```go
// MyEndpoint does X.
// @Summary Do X.
// @Tags my-capability
// @Param request body schema.MyRequest true "payload"
// @Success 200 {object} schema.MyResponse "Response"
// @Router /v1/my-endpoint [post]
func MyEndpoint(...) echo.HandlerFunc { ... }
```
Use an existing tag when the endpoint extends an existing area (e.g. `audio`, `images`, `face-recognition`). Create a new tag only when the endpoint introduces a genuinely new capability surface — and in that case, also register it in step 2.
After adding endpoints, regenerate the embedded spec so the runtime serves it:
```bash
make protogen-go # ensures gRPC codegen is fresh first
make swagger # regenerates swagger/swagger.json
```
### 2. `/api/instructions` registry (for new capability areas)
`core/http/endpoints/localai/api_instructions.go` defines `instructionDefs` — a lightweight, machine-readable index of capability areas that groups swagger endpoints by tag. It's the primary discovery surface for agents and SDKs ("what can this server do?").
**When to update:** only when adding a new capability area (a new swagger tag). Existing-tag additions automatically surface without any change here.
Add an entry to `instructionDefs`:
```go
{
Name: "my-capability", // URL segment at /api/instructions/my-capability
Description: "Short sentence describing the capability",
Tags: []string{"my-capability"}, // must match swagger @Tags
Intro: "Optional gotcha/context that isn't in the swagger descriptions (caveats, defaults, cross-references to other endpoints).",
},
```
Also bump the expected-length count in `api_instructions_test.go` and add the name to the `ContainElements` assertion.
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
- `Usecase<Name>` string constant in `core/config/backend_capabilities.go`
- `UsecaseInfoMap` entry mapping the string to its flag + gRPC method
- `FLAG_<NAME>` bitmask in `core/config/model_config.go`
- `GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
- `ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
- `GuessUsecases()` branch listing the backends that own this capability
- `usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
- `Models.jsx` `FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
- `core/http/react-ui/src/utils/capabilities.js`:
```js
export const CAP_MY_CAPABILITY = 'FLAG_MY_CAPABILITY'
```
React pages that want to filter the ModelSelector by capability import this symbol. Declare it even if you're not building the UI page yet — the declaration keeps the Go/JS vocabularies in sync.
### 4. `docs/content/` (user-facing documentation)
A new capability deserves its own page under `docs/content/features/`, plus cross-links from related features and an entry in `docs/content/whats-new.md`. See the pattern used by `face-recognition.md` / `object-detection.md`.
## Path protection rules
The global auth middleware classifies paths as API paths or non-API paths:
@@ -248,12 +320,36 @@ If you add endpoints under a new top-level path prefix, add it to `isAPIPath()`
When adding a new endpoint:
**Routing & auth**
- [ ] Handler in `core/http/endpoints/`
- [ ] Route registered in appropriate `core/http/routes/` file
- [ ] Auth level chosen: public / standard / admin / feature-gated
- [ ] If feature-gated: constant in `permissions.go`, metadata in `features.go`, middleware in `app.go`
- [ ] Entry added to `RouteFeatureRegistry` in `core/http/auth/features.go` (one row per route/method — all /v1/* routes gate through this, not per-route middleware)
- [ ] If new feature: constant in `permissions.go`, added to the right slice (`APIFeatures` default-ON / `AgentFeatures` default-OFF), metadata in `features.go` `*FeatureMetas()`
- [ ] If feature uses group middleware: wired in `core/http/app.go` and passed to the route registration function
- [ ] If new path prefix: added to `isAPIPath()` in `middleware.go`
- [ ] If OpenAI-compatible: entry in `RouteFeatureRegistry`
- [ ] If token-counting: `usageMiddleware` added to middleware chain
- [ ] Error responses use `schema.ErrorResponse` format
**Advertising surfaces (easy to miss — see the [Advertising surfaces](#advertising-surfaces--where-to-register-a-new-capability) section)**
- [ ] Swagger block on the handler: `@Summary`, `@Tags`, `@Param`, `@Success`, `@Router`
- [ ] If new capability area (new swagger tag): entry in `instructionDefs` in `core/http/endpoints/localai/api_instructions.go` + test count bumped in `api_instructions_test.go`
- [ ] If new `FLAG_*` usecase flag: matching `CAP_*` symbol exported from `core/http/react-ui/src/utils/capabilities.js`
- [ ] `docs/content/features/<feature>.md` created; cross-links from related feature pages; entry in `docs/content/whats-new.md`
**Quality**
- [ ] Error responses use `schema.ErrorResponse` format (or `echo.NewHTTPError` with a mapped gRPC status — see the `mapBackendError` helper in `core/http/endpoints/localai/images.go`)
- [ ] Tests cover both authenticated and unauthenticated access
- [ ] Swagger regenerated (`make swagger`) if you changed any `@Router`/`@Tags`/`@Param` annotation
## Companion: MCP admin tool surface
**Required for admin endpoints.** Every new admin endpoint MUST be considered for the MCP admin tool surface — the REST API and the MCP tool catalog can drift silently otherwise, and both the LocalAI Assistant chat modality and the standalone `local-ai mcp-server` rely on `pkg/mcp/localaitools/` to mirror REST.
Two outcomes are acceptable; one is not:
- **Tool added.** The new endpoint is something an admin would manage conversationally (install, list, edit, toggle, upgrade). Follow the full checklist in [.agents/localai-assistant-mcp.md](localai-assistant-mcp.md): add a `LocalAIClient` interface method, implement it in both `inproc` and `httpapi`, register the tool with a `Tool*` constant, update the skill prompts, **and add the route to `toolToHTTPRoute` in `pkg/mcp/localaitools/coverage_test.go`**.
- **Tool deliberately skipped.** The endpoint is internal/diagnostic and adding a chat path would be misleading. Document the decision in the PR description; no code action.
- **Forgot.** This breaks the contract. The `TestToolHTTPRouteMappingComplete` test in `pkg/mcp/localaitools` is a partial guard (it checks every `Tool*` has a route mapping), but it does NOT detect new REST endpoints without a tool — that's still a process check on the PR author.
**Add to the bottom of the checklist below**:
- [ ] If admin: decided whether MCP coverage is needed; if yes, tool registered + map updated; if no, skip-reason in PR description.

120
.agents/backend-signing.md Normal file
View File

@@ -0,0 +1,120 @@
# Backend image signing & verification
LocalAI verifies backend OCI images against a per-gallery keyless-cosign
policy. This page documents the trust model, the producer side
(`.github/workflows/backend_merge.yml` in this repo), and the consumer
side (`pkg/oci/cosignverify` plus the gallery YAML).
## Trust model
- **Producer:** `.github/workflows/backend_merge.yml` signs each pushed
manifest list with `cosign sign --recursive` in keyless mode after
`docker buildx imagetools create`. The signing cert is issued by
Fulcio bound to the workflow's OIDC identity. There is no long-lived
signing key. `--recursive` signs both the manifest list and every
per-arch entry — needed because our consumer resolves a tag to a
per-arch manifest before checking signatures.
- **Storage:** Signatures are written as OCI 1.1 referrers
(`--registry-referrers-mode=oci-1-1`) in the new Sigstore bundle format
(`--new-bundle-format`). No `:sha256-<hex>.sig` tag clutter.
- **Consumer:** `pkg/oci/cosignverify` discovers the bundle via the
referrers API, hands it to `sigstore-go`, and verifies it against the
policy declared in the gallery YAML (`Gallery.Verification`).
- **Revocation:** Keyless cosign certs are ephemeral (10-minute Fulcio
validity), so revocation is policy-side, not CA-side. The gallery's
`verification.not_before` (RFC3339) is the kill-switch — advance it to
invalidate every signature produced before a known compromise window.
## Producer setup
`backend_merge.yml` is the workflow that joins per-arch digests into the
multi-arch manifest list users actually pull, so it's also the right place
to sign. The job needs:
- `permissions: { id-token: write, contents: read }` at the job level so
the runner can exchange its GitHub OIDC token for a Fulcio cert.
- `sigstore/cosign-installer@v3` step (cosign ≥ 2.2 for
`--new-bundle-format`).
- After each `docker buildx imagetools create`, resolve the resulting
list digest with `docker buildx imagetools inspect <tag> --format
'{{.Manifest.Digest}}'` and sign:
```sh
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"${REGISTRY_REPO}@${DIGEST}"
```
Sign by digest, never by tag — signing by tag binds the signature to
whatever the tag points at *now*, and a subsequent tag push orphans it.
`backend_build_darwin.yml` builds and pushes single-arch darwin images
that bypass the manifest-list merge. If/when those entries get a gallery
`verification:` policy, the equivalent cosign step has to land there
too.
## Consumer setup (in `mudler/LocalAI` gallery YAML)
Once CI is signing, add a `verification:` block to the backend gallery
entry (`backend/index.yaml`):
```yaml
- name: localai
url: github:mudler/LocalAI/backend/index.yaml@master
verification:
issuer: "https://token.actions.githubusercontent.com"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@refs/heads/master$"
# Optional revocation cutoff; advance during incident response.
# not_before: "2026-06-01T00:00:00Z"
```
Identity matching pins the OIDC subject Fulcio issued the signing cert
to. Without this, any image signed by *anyone* with a Fulcio cert would
pass — the regex is what makes a signature mean "produced by our CI".
## Strict mode
Default behaviour: OCI backends without a `verification:` block install
with a warning (logs include `installing OCI backend without signature
verification`). Tarball/HTTP backends without a `sha256` field log a
similar warning.
For production, set `LOCALAI_REQUIRE_BACKEND_INTEGRITY=1` (or pass
`--require-backend-integrity` to `local-ai run` / `local-ai backends
install` / `local-ai models install`). The warning becomes a hard error
and unverifiable backends refuse to install.
## Revocation playbook
If `backend_merge.yml` (or any workflow with `id-token: write`) is
compromised and we've shipped malicious signed images:
1. **Identify the compromise window.** Find the earliest IntegratedTime
from the bad signatures (Rekor search by `subject` filter).
2. **Set `verification.not_before`** in `backend/index.yaml` to a
timestamp just *after* that window's start.
3. **Push the YAML.** Deployed LocalAI instances pick it up on next
gallery refresh (1-hour cache in `core/gallery/gallery.go`).
4. **Fix the underlying compromise** in the workflow and re-sign images
with the new build, which will have IntegratedTime > `not_before`.
5. **Optional:** for absolute decisiveness, also rotate to a new
workflow path (`backend_merge_v2.yml`) and update `identity_regex`.
## Where the code lives
- `pkg/oci/cosignverify/` — verifier, policy, OCI referrer fetch, NotBefore enforcement.
- `pkg/downloader/uri.go``WithImageVerifier` option threaded through `DownloadFileWithContext`.
- `core/gallery/backends.go``backendDownloadOptions` builds the verifier from the gallery's policy.
- `core/config/gallery.go``Gallery.Verification` YAML schema.
- `core/cli/run.go`, `core/cli/backends.go`, `core/cli/models.go``--require-backend-integrity` flag propagation.
- `.github/workflows/backend_merge.yml` — producer-side `cosign sign --recursive` after each multi-arch manifest list push.
## Out of scope (follow-ups)
- **Signing the gallery YAML itself.** The index is fetched over HTTPS
from GitHub; we trust the host. A cosign blob signature on the YAML
would close that gap but adds key-management overhead. Revisit this
page if/when added.
- **Tarball/HTTP backend signing.** Cosign can sign arbitrary blobs, but
for now non-OCI backends keep using the `sha256:` field in YAML.

View File

@@ -8,8 +8,9 @@ Let's say the user wants to build a particular backend for a given platform. For
- The Makefile has targets like `docker-build-coqui` created with `generate-docker-build-target` at the time of writing. Recently added backends may require a new target.
- At a minimum we need to set the BUILD_TYPE, BASE_IMAGE build-args
- Use .github/workflows/backend.yml as a reference it lists the needed args in the `include` job strategy matrix
- l4t and cublas also requires the CUDA major and minor version
- Use `.github/backend-matrix.yml` as a reference — it's the data-only YAML that lists every backend variant's `build-type`, `base-image`, `platforms`, etc. (`backend.yml` and `backend_pr.yml` consume it via `scripts/changed-backends.js`).
- l4t and cublas also require the CUDA major and minor version.
- For llama-cpp / ik-llama-cpp / turboquant the matrix also sets `builder-base-image` pointing at a prebuilt `quay.io/go-skynet/ci-cache:base-grpc-*` tag. Local `make backends/<name>` defaults to `BUILDER_TARGET=builder-fromsource` and doesn't need it — the Dockerfile's from-source stage installs everything itself.
- You can pretty print a command like `DOCKER_MAKEFLAGS=-j$(nproc --ignore=1) BUILD_TYPE=hipblas BASE_IMAGE=rocm/dev-ubuntu-24.04:7.2.1 make docker-build-coqui`
- Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
- The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.

250
.agents/ci-caching.md Normal file
View File

@@ -0,0 +1,250 @@
# CI Build Caching
Container builds — both the root LocalAI image (`Dockerfile`) and the per-backend images (`backend/Dockerfile.*`) — share a registry-backed BuildKit cache plus a layered set of prebuilt base images. This file explains how the cache is laid out, what invalidates it, and how to bypass it.
## Workflow surfaces
| Workflow | Purpose | Triggers |
|---|---|---|
| `.github/workflows/backend.yml` | Backend container images on master | `push` to master + tags, weekly Sunday cron, `workflow_dispatch` |
| `.github/workflows/backend_pr.yml` | Backend container images on PRs | `pull_request` |
| `.github/workflows/backend_build.yml` | Reusable: builds one backend (one arch) by digest | `workflow_call` from above |
| `.github/workflows/backend_merge.yml` | Reusable: assembles per-arch digests into a multi-arch manifest list | `workflow_call` |
| `.github/workflows/backend_build_darwin.yml` | Reusable: macOS-native backend builds | `workflow_call` |
| `.github/workflows/image.yml` / `image-pr.yml` | Root LocalAI image (push / PR) | push / PR |
| `.github/workflows/image_build.yml` / `image_merge.yml` | Reusable: per-arch root-image build + merge | `workflow_call` |
| `.github/workflows/base-images.yml` | Builds the prebuilt `base-grpc-*` builder bases | Saturdays 05:00 UTC cron, `workflow_dispatch`, master push touching `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/apt-mirror.sh`, or this workflow |
The matrix that drives `backend.yml` / `backend_pr.yml` lives in **`.github/backend-matrix.yml`** (data-only YAML, not embedded in the workflow). `scripts/changed-backends.js` parses it, applies path-filter logic against the PR diff (PR events) or the GitHub Compare API (push events), and emits the filtered matrix plus a `merge-matrix` for backends with multiple per-arch entries.
## Cache layout
- **Cache registry**: `quay.io/go-skynet/ci-cache`
- **One tag per matrix entry per arch**, derived from `tag-suffix` and `platform-tag`:
- Backend builds (`backend_build.yml`): `cache<tag-suffix>-<platform-tag>`
- e.g. `cache-cpu-faster-whisper-amd64`, `cache-cpu-faster-whisper-arm64`, `cache-gpu-nvidia-cuda-13-llama-cpp-amd64`
- Root image builds (`image_build.yml`): `cache-localai<tag-suffix>-<platform-tag>` (with a `-core` placeholder when `tag-suffix` is empty, so `cache-localai-core-amd64` for the core image)
- Pre-built base images (`base-images.yml`): `cache-base-grpc-<variant>` (one per `(BUILD_TYPE, arch)` permutation)
- Each tag stores a multi-arch BuildKit cache manifest (`mode=max`), so every intermediate stage is re-usable, not just the final image.
The per-arch suffix exists because amd64 and arm64 builds produce different intermediate content; sharing one cache key would thrash on every cross-arch rebuild.
## Read/write semantics
| Trigger | `cache-from` | `cache-to` |
|---|---|---|
| `push` to `master` / tag / cron / dispatch | yes | yes (`mode=max,ignore-error=true`) |
| `pull_request` | yes | **no** |
PR builds read master's warm cache but never write — this prevents PRs from polluting the shared cache with their experimental state. After merge, the master build for that matrix entry refreshes the cache.
`ignore-error=true` on the write side means a transient quay push failure does not fail the build; the next master push retries.
## Pre-built base images (`base-grpc-*`)
The C++ backend Dockerfiles (`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}`) compile gRPC from source. On a cold build that's ~2535 min before any LocalAI source compiles. To skip that on CI, `.github/workflows/base-images.yml` builds and pushes a set of pre-prepped builder bases:
| Tag | Contents |
|---|---|
| `base-grpc-amd64` / `base-grpc-arm64` | Ubuntu 24.04 + apt build deps + protoc + cmake + gRPC at `/opt/grpc` |
| `base-grpc-cuda-12-amd64` | the above + CUDA 12.8 toolkit |
| `base-grpc-cuda-13-amd64` | the above + CUDA 13.0 toolkit (Ubuntu 22.04 base) |
| `base-grpc-cuda-13-arm64` | the above + CUDA 13.0 sbsa toolkit (Ubuntu 24.04 base) |
| `base-grpc-l4t-cuda-12-arm64` | JetPack r36.4.0 base (CUDA preinstalled, `SKIP_DRIVERS=true`) + gRPC |
| `base-grpc-rocm-amd64` | rocm/dev-ubuntu-24.04:7.2.1 base + hipblas/hipblaslt/rocblas + gRPC |
| `base-grpc-vulkan-amd64` / `base-grpc-vulkan-arm64` | Ubuntu 24.04 + Vulkan SDK 1.4.335 + gRPC |
| `base-grpc-intel-amd64` | intel/oneapi-basekit:2025.3.2 base + gRPC |
**Single source of truth**: the install logic for all 10 variants lives in `.docker/install-base-deps.sh`. Both `Dockerfile.base-grpc-builder` AND each variant Dockerfile's `builder-fromsource` stage bind-mount and execute the same script — so the prebuilt CI base and the local from-source path are bit-equivalent by construction.
### How variant Dockerfiles consume the base
`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` are multi-target. Three stages plus a final aliasing stage:
- `builder-fromsource``FROM ${BASE_IMAGE}` then runs `install-base-deps.sh` and the per-backend compile script. Used when `BUILDER_TARGET=builder-fromsource` (the default; local `make backends/<name>`).
- `builder-prebuilt``FROM ${BUILDER_BASE_IMAGE}` (one of the prebuilt `base-grpc-*` tags) and runs only the per-backend compile script. Used when `BUILDER_TARGET=builder-prebuilt` (CI when the matrix entry sets `builder-base-image`).
- `FROM ${BUILDER_TARGET} AS builder` — alias resolves the ARG-selected stage to a fixed name (BuildKit doesn't allow ARG expansion in `COPY --from=`).
- `FROM scratch` + `COPY --from=builder ...package/. ./` — emits the final scratch image with just the package contents.
BuildKit prunes the unreferenced builder stage, so each build only runs the path it needs. `backend_build.yml` derives `BUILDER_TARGET=builder-prebuilt` automatically when the matrix entry has a non-empty `builder-base-image`; otherwise it defaults to `builder-fromsource`.
The matrix `(build-type, platforms)``builder-base-image` mapping for llama-cpp / ik-llama-cpp / turboquant entries:
| `build-type` | `platforms` | tag |
|---|---|---|
| `''` | `linux/amd64` | `base-grpc-amd64` |
| `''` | `linux/arm64` | `base-grpc-arm64` |
| `cublas` cuda 12 | `linux/amd64` | `base-grpc-cuda-12-amd64` |
| `cublas` cuda 13 | `linux/amd64` | `base-grpc-cuda-13-amd64` |
| `cublas` cuda 13 | `linux/arm64` | `base-grpc-cuda-13-arm64` |
| `cublas` cuda 12 + JetPack base | `linux/arm64` | `base-grpc-l4t-cuda-12-arm64` |
| `hipblas` | `linux/amd64` | `base-grpc-rocm-amd64` |
| `vulkan` | `linux/amd64` | `base-grpc-vulkan-amd64` |
| `vulkan` | `linux/arm64` | `base-grpc-vulkan-arm64` |
| `sycl_*` | `linux/amd64` | `base-grpc-intel-amd64` |
### Bootstrap order when adding a new variant
If you add a new entry to `base-images.yml`'s matrix, the new tag does not exist on quay until the workflow runs. To consume it from a variant entry safely, dispatch the base-images workflow on the branch first:
```bash
gh workflow run base-images.yml --ref <feature-branch>
```
Wait for the new variant to push, then merge the consumer change. Otherwise the consumer's CI fails with "image not found."
## Per-arch native builds + manifest merge
Multi-arch backends (and the core LocalAI image) build natively per arch instead of running both arches under QEMU emulation on a single x86 runner. The pattern:
- The matrix has TWO entries per multi-arch backend, sharing the same `tag-suffix` but distinct `platforms` + `platform-tag` + `runs-on`. Example: `-cpu-faster-whisper` has one amd64 entry on `ubuntu-latest` and one arm64 entry on `ubuntu-24.04-arm`.
- Each per-arch build pushes by **canonical digest only** (no tags) via `outputs: type=image,push-by-digest=true,name-canonical=true,push=true`. The digest is uploaded as an artifact named `digests<tag-suffix>-<platform-tag>` (or `digests-localai<...>` for root-image builds).
- `scripts/changed-backends.js` detects shared `tag-suffix` and emits a `merge-matrix` output. `backend.yml` / `backend_pr.yml` have a `backend-merge-jobs` job that consumes it and calls `backend_merge.yml`.
- `backend_merge.yml` downloads all matching digest artifacts and runs `docker buildx imagetools create` to publish the final tagged manifest list pointing at both per-arch digests. Same `docker/metadata-action` config as the original monolithic build, so consumers see no tag-shape change.
- `image_merge.yml` is the equivalent for the root LocalAI image (`-core` placeholder when `tag-suffix` is empty so the artifact-name glob doesn't over-match across `core` and `gpu-vulkan`).
**`provenance: false` is required on multi-registry digest pushes**: with the default `mode=max` provenance attestation, BuildKit bundles a per-registry attestation manifest into each registry's manifest list, making the resulting list digest diverge across registries. `steps.build.outputs.digest` only matches one of them and the merge step's `imagetools create <reg>@sha256:<digest>` lookup fails on the other. Setting `provenance: false` keeps the digest content-only and identical across registries.
## Path filter on master push
Both `backend.yml` (push) and `backend_pr.yml` (PR) generate their matrix dynamically through `scripts/changed-backends.js`:
- **PR events**: paginated `pulls/{n}/files` API → filter the matrix to entries whose `dockerfile` path prefix matches the PR diff.
- **Push events**: GitHub Compare API (`/repos/{owner}/{repo}/compare/{before}...{after}`) → same path-filter logic. Falls back to "run everything" on first-branch push (`event.before` zero), API truncation (≥300 changed files), missing API token, or any thrown error.
- **Tag pushes**: `FORCE_ALL=true` is set from the workflow side (`startsWith(github.ref, 'refs/tags/')`) — releases rebuild every backend regardless of diff.
- **Schedule / `workflow_dispatch`**: no `event.before`, falls through to "run everything" automatically.
The Sunday 06:00 UTC cron on `backend.yml` exists specifically because path filtering can leave Python backends frozen on stale wheels. `DEPS_REFRESH` (below) only fires when the build actually runs, so an untouched Python backend would never re-resolve its unpinned deps. The weekly cron is the safety net.
## The `DEPS_REFRESH` cache-buster (Python backends)
Every Python backend goes through the shared `backend/Dockerfile.python`, which ends with:
```dockerfile
ARG DEPS_REFRESH=initial
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
```
Most Python backends ship `requirements*.txt` files that **do not pin every transitive dep** (`torch`, `transformers`, `vllm`, `diffusers`, etc. are listed without a `==` pin, or with `>=` lower bounds only). With a warm BuildKit cache, the `make` layer hashes only on Dockerfile instructions + COPYed source — not on what `pip install` resolves at runtime. So a warm cache would ship the *first* version of `vllm` ever cached and never pick up upstream releases.
`DEPS_REFRESH` defends against that:
- `backend_build.yml` computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W19`) before each build and passes it as a build-arg.
- The `RUN ... make` layer's BuildKit hash now includes that string, so the layer invalidates **at most once per week**, automatically picking up newer wheels.
- Within a week, builds stay warm.
This applies only to `Dockerfile.python` because:
- Go (`Dockerfile.golang`) pins versions in `go.mod` / `go.sum`.
- Rust (`Dockerfile.rust`) pins via `Cargo.lock`.
- C++ backends pin gRPC (`v1.65.0`) and llama.cpp at a specific commit; their inputs don't drift between rebuilds.
### Adjusting the cadence
Bump the format to daily (`+%Y-%m-%d`) or hourly (`+%Y-%m-%d-%H`) for faster refreshes. For one-shot rebuilds without changing the schedule, append a marker to the tag-suffix in the matrix or temporarily delete that backend's cache tag in quay.
## ccache for C++ backend builds
`Dockerfile.{llama-cpp,ik-llama-cpp,turboquant}` declare a BuildKit cache mount on `/root/.ccache`:
```dockerfile
RUN --mount=type=cache,target=/root/.ccache,id=<backend>-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
```
The compile script exports `CMAKE_C/CXX/CUDA_COMPILER_LAUNCHER=ccache` so CMake threads ccache through gcc/g++/nvcc. `cache-to: type=registry,mode=max` exports the cache mount data into the registry cache, so subsequent builds restore it.
On a `LLAMA_VERSION` bump, most translation units are byte-identical to the previous version's preprocessed source — ccache returns the previous `.o` and skips the real compile. Same for LocalAI source changes that don't actually touch llama.cpp's CMake inputs. Cache scope is per `(TARGETARCH, BUILD_TYPE)` so e.g. cublas-12 doesn't share with cublas-13 (their CUDA headers differ; cross-pollination would just be cache misses anyway).
## Composite actions
Two composite actions handle runner-side prep:
- **`.github/actions/free-disk-space/action.yml`** — wraps `jlumbroso/free-disk-space@main` plus an explicit apt purge of dotnet/android/ghc/mono/etc. Reclaims ~610 GB on `ubuntu-latest`. No-op on self-hosted runners. Used by `backend_build.yml`, `image_build.yml`, `test.yml`, `tests-aio.yml`, etc.
- **`.github/actions/setup-build-disk/action.yml`** — relocates Docker's data-root to `/mnt` on hosted X64 runners. GHA hosted `ubuntu-latest` ships ~75 GB of unused space at `/mnt`; combined with the free-disk-space cleanup this gives ~100 GB working space — enough for ROCm dev image + vLLM torch install + flash-attn intermediate layers. No-op on self-hosted and on non-X64 hosted runners. Used by `backend_build.yml`, `image_build.yml`, `base-images.yml`.
Both actions run before any docker buildx step.
## Concurrency
All `backend.yml` / `image.yml` / `test.yml` / etc. workflows use:
```yaml
concurrency:
group: ci-<workflow>-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
```
- **PR events** group by PR number → newer pushes to the same PR cancel old runs (intended).
- **Push events** group by `github.sha` → each master commit gets its own run; rapid-fire merges don't cancel each other (this was a real issue prior — two master pushes 11 seconds apart would cancel the first's CI).
## Self-warming, no separate populator
There is no cron job that pre-warms the BuildKit cache for individual backends. The production builds *are* the populators. The first master build of a given matrix entry pays the cold cost; subsequent same-entry master builds reuse everything that hasn't changed (apt installs, gRPC compile in the variant `builder-fromsource` stage or skipped entirely when consuming `base-grpc-*`, Python wheel installs, etc.). The base-images workflow's weekly cron is the closest thing to a populator and only refreshes the prebuilt builder bases.
## Manually evicting cache
To force a fully cold build for one backend or the whole image:
```bash
# Delete a single tag (requires quay credentials with admin on the repo)
curl -X DELETE \
-H "Authorization: Bearer ${QUAY_TOKEN}" \
https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/cache-gpu-nvidia-cuda-12-vllm-amd64
# List all tags
curl -s -H "Authorization: Bearer ${QUAY_TOKEN}" \
"https://quay.io/api/v1/repository/go-skynet/ci-cache/tag/?limit=100" | jq '.tags[].name'
```
Eviction is rarely needed in normal operation — `DEPS_REFRESH` handles weekly drift, source changes invalidate naturally, and `mode=max` keeps the cache scoped per matrix entry per arch so a stale tag never bleeds into a different build.
## What the cache does **not** cover
- The `free-disk-space` and `setup-build-disk` composite actions run on every job — these reclaim runner-state, not Docker layers, so BuildKit caches don't apply.
- Intermediate artifacts of `Build (PR)` are not pushed anywhere — PRs only build for verification.
- Darwin builds (see below) — macOS runners have no Docker daemon, so the registry-backed BuildKit cache cannot apply.
## Darwin native caches
`backend_build_darwin.yml` runs natively on `macOS-14` GitHub-hosted runners — there is no Docker, no BuildKit, no cross-job registry cache. Instead, the reusable workflow uses `actions/cache@v4` for four native caches that mirror the spirit of the Linux cache (warm by default, weekly refresh for unpinned Python deps, PRs read-only).
| Cache | Path(s) | Key | Scope |
|---|---|---|---|
| Go modules + build | `~/go/pkg/mod`, `~/Library/Caches/go-build` | `go.sum` (managed by `actions/setup-go@v5` `cache: true`) | All darwin jobs |
| Homebrew | `~/Library/Caches/Homebrew/downloads`, selected `/opt/homebrew/Cellar/*` | hash of `backend_build_darwin.yml` | All darwin jobs |
| ccache (llama.cpp CMake) | `~/Library/Caches/ccache` | pinned `LLAMA_VERSION` from `backend/cpp/llama-cpp/Makefile` | `inputs.backend == 'llama-cpp'` only |
| Python wheels (uv + pip) | `~/Library/Caches/pip`, `~/Library/Caches/uv` | `inputs.backend` + ISO week (`+%Y-W%V`) + hash of that backend's `requirements*.txt` | `inputs.lang == 'python'` only |
Read/write semantics match the BuildKit cache: `actions/cache/restore` runs every time, `actions/cache/save` is gated on `github.event_name != 'pull_request'`. PRs read master's warm cache but never write back.
The Python wheel cache uses the same ISO-week cache-buster as the Linux `DEPS_REFRESH` build-arg — same problem (unpinned `torch`/`mlx`/`diffusers`/`transformers` resolve to fresh wheels weekly), same ~one-cold-rebuild-per-week solution.
The brew Cellar cache requires `HOMEBREW_NO_AUTO_UPDATE=1` and `HOMEBREW_NO_INSTALL_CLEANUP=1` (set as job-level env). Without those, `brew install` would mutate the very directories that were just restored, defeating the cache.
**Force-link after cache restore**: `actions/cache` restores `/opt/homebrew/Cellar/*` but NOT the `/opt/homebrew/bin/*` symlinks. After a cache hit, `brew install` sees the Cellar entries and decides "already installed" without re-running its link step, leaving the formulas off PATH. The Dependencies step explicitly runs `brew link --overwrite` for every cached formula afterwards to ensure the symlinks exist.
For ccache, the workflow exports `CMAKE_ARGS=… -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache` via `$GITHUB_ENV` before running `make build-darwin-go-backend`. The Makefile in `backend/cpp/llama-cpp/` already forwards `CMAKE_ARGS` through to each variant build (`fallback`, `grpc`, `rpc-server`), so no script changes are needed. The three variants share most TUs, so ccache dedupes object files across them.
`backend_build_darwin.yml` also has a llama-cpp-specific build-step branch that runs `make backends/llama-cpp-darwin` (the bespoke script that compiles three CMake variants and bundles dylibs via `otool`), distinct from the generic `make build-darwin-${lang}-backend` path. This was consolidated from a previously-bespoke top-level `llama-cpp-darwin` job in `backend.yml` so llama-cpp on Darwin honors the same path filter as the other 34 Darwin backends.
### Cache budget on Darwin
GitHub Actions caches are limited to 10 GB per repo. Steady-state worst case: ~800 MB Go cache + ~2 GB brew Cellar + up to 2 GB ccache + ~1.5 GB × 5 python backends. If the cap is hit, prefer collapsing the per-backend Python keys into a shared `pyenv-darwin-shared-<week>` key (accepts more cross-backend churn for a smaller footprint) before reducing other caches.
## Self-hosted runners
`.github/backend-matrix.yml` has zero references to `arc-runner-set` or `bigger-runner` — all backends run on GHA free-tier hosted runners (`ubuntu-latest` for amd64, `ubuntu-24.04-arm` for arm64 native, `macos-14` for Darwin). The migration off self-hosted relied on the per-arch native split (no QEMU emulation) plus `setup-build-disk`'s `/mnt` relocation (~100 GB working space, enough for ROCm dev image + vLLM/torch installs).
One residual self-hosted reference remains in `test-extra.yml` (`tests-vibevoice-cpp-grpc-transcription` uses `bigger-runner` for the 30s JFK-decode timeout headroom). That's a separate concern.
## Touching the cache pipeline
When changing `image_build.yml`, `backend_build.yml`, any of the `backend/Dockerfile.*` files, `Dockerfile.base-grpc-builder`, `.docker/install-base-deps.sh`, `.docker/<backend>-compile.sh`, or `scripts/changed-backends.js`:
1. **Don't drop `DEPS_REFRESH=...` from the build-args** without a replacement strategy (lockfiles, pinned requirements). Otherwise master will silently freeze on whichever versions were cached at the time.
2. **Keep `(tag-suffix, platform-tag)` unique per matrix entry** — together they're the cache namespace. Two matrix entries sharing a key would clobber each other's cache.
3. **Keep `cache-to` gated on `github.event_name != 'pull_request'`** — PRs must not write.
4. **Keep `ignore-error=true` on `cache-to`** — quay registry hiccups must not fail builds.
5. **Keep `provenance: false` on push-by-digest steps** — multi-registry digest divergence is the Bug We Already Fixed; reintroducing provenance attestation re-breaks the merge.
6. **`install-base-deps.sh` is the single source of truth for base contents.** Both `Dockerfile.base-grpc-builder` (CI) and the variant Dockerfiles' `builder-fromsource` (local) bind-mount and execute it. If you add a package to one path, add it to the script — don't fork the logic into a Dockerfile RUN.
7. **After adding a `base-images.yml` matrix variant, run the workflow on your branch before merging consumer changes** that depend on the new tag — otherwise the consumer's CI fails "image not found."

View File

@@ -42,6 +42,14 @@ trim_trailing_whitespace = false
Use `github.com/mudler/xlog` for logging which has the same API as slog.
## Go tests
All Go tests — including backend tests — must use [Ginkgo](https://onsi.github.io/ginkgo/) (v2) with Gomega matchers, not the stdlib `testing` package with `t.Run` / `t.Errorf`. A test file should register a suite with `RegisterFailHandler(Fail)` in a `TestXxx(t *testing.T)` bootstrap and use `Describe`/`Context`/`It` blocks for the actual cases. Look at any existing `*_test.go` under `core/` or `pkg/` for a template.
Do not mix styles within a package. If you are extending tests in a package that already uses Ginkgo, keep using Ginkgo. If you find stdlib-style Go tests in the tree, treat them as tech debt to be migrated rather than as a pattern to follow.
This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
## Documentation
The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.

84
.agents/ds4-backend.md Normal file
View File

@@ -0,0 +1,84 @@
# Working on the ds4 Backend
`antirez/ds4` is a single-model inference engine for DeepSeek V4 Flash.
LocalAI wraps the engine's C API (`ds4/ds4.h`) with a fresh C++ gRPC server at
`backend/cpp/ds4/` - NOT a fork of llama-cpp's grpc-server.cpp.
## Pin
`backend/cpp/ds4/Makefile` pins `DS4_VERSION?=<sha>` at the top. The `ds4`
target in the Makefile clones `antirez/ds4` at that commit (mirroring the
llama-cpp / ik-llama-cpp / turboquant pattern). The bump-deps bot
(`.github/workflows/bump_deps.yaml`) finds this pin via grep and opens a
daily PR to update it. To bump manually: edit the `DS4_VERSION?=` line,
then `make purge && make` (or rely on CI's clean build).
## Wire shape
| RPC | Implementation |
|---|---|
| Health, Free, Status | Trivial; no engine dependency for Health |
| LoadModel | `ds4_engine_open` + `ds4_session_create`; backend is compile-time (DS4_NO_GPU → CPU, __APPLE__ → Metal, otherwise CUDA) |
| TokenizeString | `ds4_tokenize_text` |
| Predict | `ds4_engine_generate_argmax` + `DsmlParser` → one ChatDelta with content / reasoning_content / tool_calls[] |
| PredictStream | Same, per-token ChatDelta writes |
## DSML
ds4 emits tool calls as literal text markers (`<DSMLtool_calls>` etc.) -
NOT special tokens. `dsml_parser.{h,cpp}` is our streaming state machine that
classifies token bytes into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END
events. `dsml_renderer.{h,cpp}` does the prompt direction: turns
OpenAI tool_calls + role=tool messages back into DSML for the next turn.
## Thinking modes
`PredictOptions.Metadata["enable_thinking"]` gates thinking on/off (default ON).
`["reasoning_effort"] == "max" | "xhigh"` selects `DS4_THINK_MAX`; anything else
maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_prefix`.
## Disk KV cache
`kv_cache.{h,cpp}` implements an SHA1-keyed file cache using ds4's public
`ds4_session_save_payload` / `ds4_session_load_payload` API. Enable per request
via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).
## Build matrix
| Build | Where | Notes |
|---|---|---|
| `cpu-ds4` (amd64 + arm64) | Linux GHA | ds4 considers CPU debug-only; useful only for wiring tests |
| `cuda13-ds4` (amd64 + arm64) | Linux GHA + DGX Spark validation | Primary production path on Linux |
| `ds4-darwin` (arm64) | macOS GHA runners | Metal; uses `scripts/build/ds4-darwin.sh` like llama-cpp-darwin |
cuda12 is intentionally omitted. ROCm / Vulkan / SYCL are not applicable.
## Hardware-gated validation
`tests/e2e-backends/backend_test.go` in `BACKEND_BINARY` mode:
```
BACKEND_BINARY=$(pwd)/backend/cpp/ds4/package/run.sh \
BACKEND_TEST_MODEL_FILE=/path/to/ds4flash.gguf \
BACKEND_TEST_CAPS=health,load,predict,stream,tools \
BACKEND_TEST_TOOL_PROMPT="What's the weather in Paris?" \
go test -count=1 -timeout=30m -v ./tests/e2e-backends/...
```
CI does not load the model; the suite is opt-in via env vars.
## Importer
`core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
matching the `antirez/deepseek-v4-gguf` repo URI or the
`DeepSeek-V4-Flash-*.gguf` filename pattern. **Registered BEFORE
`LlamaCPPImporter`** in `defaultImporters` - both match `.gguf` but ds4 is more
specific, and first-match-wins. The importer emits `backend: ds4`, uses
`ds4flash.gguf` as the local filename (matches ds4's own CLI default), and
disables the Go-side automatic tool-parsing fallback (the C++ backend emits
ChatDelta.tool_calls natively via `DsmlParser`).
ds4 is also listed in `core/http/endpoints/localai/backend.go`'s pref-only
slice so the `/import-model` UI surfaces it as a manual choice for users who
want to force the backend on a non-canonical URI.

View File

@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
- `reasoning_format` - Reasoning format options
- Any new flags or parameters
### Speculative Decoding Types
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
### Implementation Guidelines
1. **Feature Parity**: Always aim for feature parity with llama.cpp's implementation

View File

@@ -0,0 +1,97 @@
# LocalAI Assistant — admin MCP server
This document is the contract for **anyone** (human or AI agent) touching LocalAI's admin REST surface, the in-process MCP server that wraps it, or the embedded skill prompts that teach the assistant how to use it. Read this before adding/removing/renaming admin endpoints, MCP tools, or skill recipes.
## What this feature is
`pkg/mcp/localaitools/` is a public Go package that exposes LocalAI's admin/management surface as an MCP server. It is used in two ways:
1. **In-process**: when an admin opens a chat with `metadata.localai_assistant=true`, the chat handler injects the in-memory MCP server (paired `net.Pipe()` transport, no HTTP loopback) so the LLM can install models, manage backends and edit configs by chatting.
2. **Standalone**: the `local-ai mcp-server --target=…` subcommand serves the same MCP server over stdio, talking HTTP to a remote LocalAI instance.
The two modes share **all** tool definitions and skill prompts. They differ only in their `LocalAIClient` implementation (`inproc/` calls services directly; `httpapi/` calls REST).
## The three things you must keep in sync
When you change LocalAI's admin surface, three layers must stay aligned:
1. **REST endpoint** in `core/http/endpoints/localai/*.go`.
2. **MCP tool registration** in `pkg/mcp/localaitools/tools_*.go`, plus a method on `LocalAIClient` (in `client.go`) and implementations in both `inproc/client.go` **and** `httpapi/client.go`.
3. **Skill prompt** under `pkg/mcp/localaitools/prompts/skills/*.md` — the markdown that teaches the LLM how to use the new tool. If the new tool fits an existing recipe, update that recipe; otherwise add a new file.
If you ship a REST endpoint without (2) and (3), conversational admins won't see the feature.
## Checklist for adding a new admin endpoint
- [ ] REST endpoint exists in `core/http/endpoints/localai/*.go` and is gated by `auth.RequireAdmin()` in `core/http/routes/localai.go`.
- [ ] `LocalAIClient` interface in `pkg/mcp/localaitools/client.go` has a method covering the new operation.
- [ ] DTOs added/updated in `pkg/mcp/localaitools/dto.go` (JSON-tagged; never expose raw service types).
- [ ] `inproc/client.go` implements the new method by calling the service directly (not via HTTP loopback).
- [ ] `httpapi/client.go` implements the new method by calling the REST endpoint.
- [ ] Tool registration added in the appropriate `pkg/mcp/localaitools/tools_*.go`. Mutating tools must reference safety rule 1 in the description.
- [ ] If the tool is mutating, ensure `Options{DisableMutating: true}` skips it (mirror the pattern in `tools_models.go`).
- [ ] Skill prompt added or updated under `pkg/mcp/localaitools/prompts/skills/`. The prompt must instruct the LLM when to call the tool, what to ask the user first, and what to do on error.
- [ ] Tests:
- `pkg/mcp/localaitools/server_test.go` adds the tool name to `expectedFullCatalog` and `expectedReadOnlyCatalog` (if read-only).
- Tool dispatch is added to `TestEachToolDispatchesToClient`.
- `pkg/mcp/localaitools/httpapi/client_test.go` covers the new HTTP path.
## Adding a new skill recipe (no new tool)
Sometimes you want to teach the LLM a new pattern that uses existing tools. Drop a markdown file under `pkg/mcp/localaitools/prompts/skills/<verb>_<noun>.md`. The file is automatically embedded by `//go:embed` and assembled into the system prompt in lexicographic order. No Go changes needed.
Conventions:
- Filename: `<verb>_<noun>.md` (e.g. `install_chat_model.md`, `upgrade_backend.md`).
- First line: `# Skill: <Title Case description>`.
- Number the steps. Reference exact tool names in backticks.
- If the skill mutates state, remind the LLM to confirm with the user.
## Code conventions
These rules guard against the magic-literal drift that surfaced in the first audit. Do not re-introduce bare strings.
- **Tool names** always come from the `Tool*` constants in `pkg/mcp/localaitools/tools.go`. Tool registrations, the test catalog (`server_test.go`'s `expectedFullCatalog` / `expectedReadOnlyCatalog`), and dispatch tables reference the constants. The embedded skill prompts under `prompts/` keep bare strings — that's the one allowed exception, and `TestPromptsContainSafetyAnchors` enforces alignment.
- **Toggle/pin actions** use the `modeladmin.Action` type (`pkg/mcp/localaitools` and `core/services/modeladmin`). Use `ActionEnable`/`ActionDisable`/`ActionPin`/`ActionUnpin`; never bare `"enable"`/`"pin"` strings.
- **Capability tags** for `list_installed_models` use the `localaitools.Capability` type (`capability.go`). The `LocalAIClient.ListInstalledModels` interface takes a typed `Capability`, and the `inproc` switch only accepts canonical values (`"embed"`/`"embedding"` are not aliases — only `CapabilityEmbeddings`).
- **HTTP error checks** in `httpapi.Client` use `errors.Is(err, ErrHTTPNotFound)`, not substring matches on `err.Error()`. The typed `*HTTPError` carries `StatusCode` and `Body`; add new sentinel errors as needed rather than re-introducing string matching.
- **Channel sends** to `GalleryService.ModelGalleryChannel` / `BackendGalleryChannel` from inproc clients MUST select on `ctx.Done()` so a cancelled chat completion releases the goroutine. See `inproc.sendModelOp` / `sendBackendOp`.
- **Disk writes** of model config YAML go through `modeladmin.writeFileAtomic` (temp file + `os.Rename`). `os.WriteFile` truncates on crash and corrupts the model.
- **MCP server lifecycle**: every initialised holder MUST register `Close()` with `signals.RegisterGracefulTerminationHandler`. The standalone `mcp-server` CLI uses `signal.NotifyContext` to honour SIGINT/SIGTERM.
## File map (where to look)
```
pkg/mcp/localaitools/
client.go # LocalAIClient interface + DTO registry
dto.go # JSON-tagged DTOs shared by both client impls
server.go # NewServer(client, opts) — registers tools
tools.go # Tool* name constants (single source of truth)
capability.go # Capability type + constants
tools_models.go # gallery_search, install_model, import_model_uri, ...
tools_backends.go
tools_config.go
tools_system.go
tools_state.go
prompts.go # //go:embed loader + SystemPrompt(opts)
prompts/00_role.md
prompts/10_safety.md # SAFETY RULES — change with care
prompts/20_tools.md # curated tool catalog with one-liners
prompts/skills/*.md
inproc/client.go # in-process LocalAIClient (services-direct)
httpapi/client.go # REST LocalAIClient (for standalone CLI / remote)
core/http/endpoints/mcp/
localai_assistant.go # process-wide holder + LocalToolExecutor
core/cli/mcp_server.go # local-ai mcp-server subcommand
```
## Why two clients
The in-process MCP server runs inside the same LocalAI binary that serves chat. Going over HTTP loopback would (a) require minting a synthetic admin API key for the server to authenticate against itself, (b) double-marshal every tool dispatch, and (c) lose access to in-process channels (e.g. `GalleryService.ModelGalleryChannel` for streaming install progress). So in-process uses `inproc.Client`. The standalone stdio CLI talks to a *remote* LocalAI; HTTP is the only option, so it uses `httpapi.Client`. Both implement the same `LocalAIClient` interface, and the parity test in `pkg/mcp/localaitools/parity_test.go` (when present) keeps their output equivalent.
## Why prompt-enforced confirmation, not code gates
The user chose KISS. Every mutating tool has a safety rule (`prompts/10_safety.md` rule 1) that requires the LLM to summarise the action and wait for explicit user confirmation before calling it. There is no `plan_*`/`apply_*` two-step in code. If you add a mutating tool, do **not** add per-tool confirmation logic in Go — instead, list the new tool name in `prompts/10_safety.md` so the LLM knows it falls under the confirmation rule.
## Distributed mode
The in-memory MCP server runs only on the head node (where the chat handler runs). `inproc.Client` wraps services that are already distributed-aware (`GalleryService` coordinates with workers; `ListNodes` reads the NATS-populated registry). No NATS routing of MCP tools — the admin surface lives on the head, period.

62
.agents/sglang-backend.md Normal file
View File

@@ -0,0 +1,62 @@
# Working on the SGLang Backend
The SGLang backend lives at `backend/python/sglang/backend.py` (async gRPC). It wraps SGLang's `Engine` (`sglang.srt.entrypoints.engine.Engine`) and translates LocalAI's gRPC `PredictOptions` into SGLang sampling params + outputs into `Reply.chat_deltas`. Structurally it mirrors `backend/python/vllm/backend.py` — keep them shaped the same so changes in one have an obvious analog in the other.
## `engine_args` is the universal escape hatch
A small fixed set of fields on `ModelOptions` is mapped to typed SGLang kwargs in `LoadModel` (model, quantization, load_format, gpu_memory_utilization → mem_fraction_static, trust_remote_code, enforce_eager → disable_cuda_graph, tensor_parallel_size → tp_size, max_model_len → context_length, dtype). **Everything else** flows through the `engine_args:` YAML map.
Validation happens in `_apply_engine_args`. Keys are checked against `dataclasses.fields(ServerArgs)` (`sglang.srt.server_args.ServerArgs` is a flat `@dataclass` with ~380 fields). Unknown keys raise `ValueError` at LoadModel time with a `difflib.get_close_matches` suggestion — same shape as the vLLM backend.
**Precedence:** typed `ModelOptions` fields populate `engine_kwargs` first, then `engine_args` overrides them. So a YAML that sets both `gpu_memory_utilization: 0.9` and `engine_args.mem_fraction_static: 0.5` ends up at `0.5`. Document this when answering "why didn't my YAML field stick?".
**ServerArgs is flat.** Unlike vLLM, where speculative decoding is nested under `engine_args.speculative_config: {...}`, SGLang exposes flat top-level fields: `speculative_algorithm`, `speculative_draft_model_path`, `speculative_num_steps`, `speculative_eagle_topk`, `speculative_num_draft_tokens`, `speculative_dflash_block_size`, etc. There is no `speculative_config:` dict. Same goes for compilation, kv-transfer, attention — all flat.
The canonical reference is `python/sglang/srt/server_args.py:ServerArgs` (line ~304). When SGLang adds new flags, no LocalAI code change is needed — they're automatically available via `engine_args:`. The validator picks them up because it introspects the live dataclass.
## Speculative decoding cheatsheet
`--speculative-algorithm` accepts `EAGLE`, `EAGLE3`, `NEXTN`, `STANDALONE`, `NGRAM`, `DFLASH`. `NEXTN` is silently rewritten to `EAGLE` in `ServerArgs.__post_init__` (`server_args.py:3286-3287`). MTP (Multi-Token Prediction) is the same EAGLE path with `num_steps=1, eagle_topk=1, num_draft_tokens=2` against a target whose architecture has multi-token heads (e.g. MiMo-7B-RL, DeepSeek-V3-MTP).
| Algorithm | Drafter requirement | Gallery demo target | Gallery demo drafter |
|-----------|--------------------|---------------------|----------------------|
| `NEXTN` / `EAGLE` (MTP) | Assistant drafter or built-in heads | google/gemma-4-E2B-it, google/gemma-4-E4B-it | google/gemma-4-E2B-it-assistant, google/gemma-4-E4B-it-assistant |
| `EAGLE3` | EAGLE3 draft head | (no gallery entry yet) | e.g. jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B |
| `DFLASH` | Block-diffusion drafter | (no gallery entry yet) | e.g. z-lab/Qwen3-4B-DFlash-b16 |
| `STANDALONE` | Smaller LLM as drafter | (no gallery entry yet) | any smaller chat-tuned LLM in the same family |
| `NGRAM` | None — uses prefix history | (no gallery entry yet) | n/a |
The Gemma 4 demos use `mem_fraction_static: 0.85` (cookbook default) and the cookbook's `num_steps=5, num_draft_tokens=6, eagle_topk=1` parameters. Other algorithms are reachable from any user YAML via `engine_args:` but don't have shipped demos yet — that's a deliberate gallery scope choice, not a backend limitation.
Gemma 4 support requires sglang built from a commit that includes [PR #21952](https://github.com/sgl-project/sglang/pull/21952). LocalAI's pinned release for cublas12 / cublas13 includes it. The `l4t13` (JetPack 7 / sbsa cu130) build floors at `sglang>=0.5.0` because the `pypi.jetson-ai-lab.io` mirror still ships only `0.5.1.post2` as of 2026-05-06 — Gemma 4 / MTP recipes are therefore not available on l4t13 until that mirror catches up. `backend.py` keeps backward compat with the 0.5.x → 0.5.11 `SamplingParams.seed``sampling_seed` rename via runtime detection.
Compatibility caveats per the SGLang docs: DFLASH and NGRAM are incompatible with `enable_dp_attention`; DFLASH requires `pp_size == 1`; STANDALONE is incompatible with `enable_dp_attention`; NGRAM is CUDA-only and disables the overlap scheduler.
### `mem_fraction_static` + quantization + MTP on consumer GPUs
When combining online weight quantization (`engine_args.quantization: fp8` / `awq` / etc.) with built-in-head MTP (`speculative_algorithm: EAGLE`/`NEXTN`) on a tight VRAM budget, sglang's default `mem_fraction_static: 0.85` will OOM during draft-worker init. The reason: sglang quantizes the **target** model's transformer blocks but loads the **MTP draft worker's vocab embedding** at the source dtype (typically bf16). For a 7 B-class model with a 150k-token vocab × 4096 hidden, that's another ~1.2 GiB allocated *after* the static pool is reserved. At 0.85 fraction on a 16 GB card there's no room left.
Workaround: drop `mem_fraction_static` to ~0.7 so the post-static heap can absorb the MTP embedding alloc + CUDA graph private pools. Verified end-to-end on MiMo-7B-RL + fp8 + MTP on a 16 GB RTX 5070 Ti (`gallery/sglang-mimo-7b-mtp.yaml`) at ~88 tok/s. Models with larger vocabs or more MTP layers (e.g. DeepSeek-V3-MTP) need an even smaller fraction.
This isn't documented anywhere upstream as of 2026-05-06 — the SGLang Gemma 4 cookbook uses 0.85 because their MTP path doesn't go through `eagle_worker_v2.py` for an embedding-bearing draft module. Don't blanket-apply 0.7 across all sglang YAMLs; only when MTP-with-built-in-heads + quantization combine.
## Tool-call and reasoning parsers stay on `Options[]`
ServerArgs has `tool_call_parser` and `reasoning_parser` fields, and the backend does pass them through to `Engine` so SGLang's own HTTP/OAI surface keeps working. But for the **LocalAI** request path the backend constructs fresh per-request parser instances in `_make_parsers` (`backend.py:286`) because the parsers are stateful — the streaming and non-streaming paths each need their own.
So the user-facing knob stays on `Options[]`:
```yaml
options:
- tool_parser:hermes
- reasoning_parser:deepseek_r1
```
Putting these in `engine_args:` will set them on `ServerArgs` but the LocalAI-level streaming `ChatDelta` will not pick them up. Don't recommend that path.
## What's missing today (out of scope, but worth tracking)
- `core/config/hooks_sglang.go` — there is no SGLang equivalent of `hooks_vllm.go`. The vLLM hook auto-selects parsers for known model families from `parser_defaults.json` and seeds production engine_args defaults. A symmetric hook for SGLang could reuse the same `parser_defaults.json` (the SGLang parser names are different but the family detection is shared) and seed defaults like `enable_metrics: true` or attention-backend choices.
- `core/gallery/importers/sglang.go` — vLLM has an importer that resolves model architecture → parser defaults at gallery-import time. A matching importer for SGLang would let `local-ai install` populate sensible parsers automatically.
These should be a follow-up PR, not a blocker for the engine_args feature.

39
.docker/apt-mirror.sh Executable file
View File

@@ -0,0 +1,39 @@
#!/bin/sh
# Reconfigure Ubuntu apt sources to point at an alternate mirror.
#
# Used by Dockerfiles via `RUN --mount=type=bind,source=.docker/apt-mirror.sh,...`
# and by CI workflows on the runner to mitigate outages of the default
# archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com pool.
#
# Inputs (env):
# APT_MIRROR Replacement for archive.ubuntu.com and security.ubuntu.com
# (e.g. "http://azure.archive.ubuntu.com" or
# "https://mirrors.edge.kernel.org").
# Leave empty to keep upstream. The trailing "/ubuntu/..."
# path is preserved by the rewrite.
# APT_PORTS_MIRROR Replacement for ports.ubuntu.com (arm64/ppc64el/...).
# Leave empty to keep upstream.
#
# Both default to empty, in which case the script is a no-op.
set -e
if [ -z "${APT_MIRROR}" ] && [ -z "${APT_PORTS_MIRROR}" ]; then
exit 0
fi
# Ubuntu 24.04 (noble) ships DEB822 sources at /etc/apt/sources.list.d/ubuntu.sources;
# older releases use /etc/apt/sources.list. We rewrite whichever exists.
for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
[ -f "$f" ] || continue
if [ -n "${APT_MIRROR}" ]; then
# Use a comma delimiter so the alternation pipe in the regex
# is not interpreted as the s/// separator.
sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
fi
if [ -n "${APT_PORTS_MIRROR}" ]; then
sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
fi
done
echo "apt-mirror: rewrote sources (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"

30
.docker/ik-llama-cpp-compile.sh Executable file
View File

@@ -0,0 +1,30 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.ik-llama-cpp.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
set -euxo pipefail
export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
fi
cd /LocalAI/backend/cpp/ik-llama-cpp
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
# ARM64 / ROCm: build without x86 SIMD
make ik-llama-cpp-fallback
else
# ik_llama.cpp's IQK kernels require at least AVX2
make ik-llama-cpp-avx2
fi
ccache -s || true

244
.docker/install-base-deps.sh Executable file
View File

@@ -0,0 +1,244 @@
#!/usr/bin/env bash
# Single source of truth for builder-base contents.
#
# Used by:
# - backend/Dockerfile.base-grpc-builder (CI prebuilt-base source of truth)
# - backend/Dockerfile.llama-cpp (builder-fromsource stage)
# - backend/Dockerfile.ik-llama-cpp (builder-fromsource stage)
# - backend/Dockerfile.turboquant (builder-fromsource stage)
#
# All four files invoke this script via
# RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
# --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
# bash /usr/local/sbin/install-base-deps
#
# so the prebuilt CI base image and the from-source local-dev path are
# bit-equivalent by construction.
#
# Inputs (env, populated from Dockerfile ARG/ENV):
# BUILD_TYPE ("cublas"|"l4t"|"hipblas"|"vulkan"|"sycl"|"clblas"|"")
# CUDA_MAJOR_VERSION ("12" | "13" | "")
# CUDA_MINOR_VERSION ("8" | "0" | "")
# TARGETARCH ("amd64" | "arm64")
# UBUNTU_VERSION ("2204" | "2404")
# SKIP_DRIVERS ("false" | "true")
# CMAKE_FROM_SOURCE ("false" | "true")
# CMAKE_VERSION ("3.31.10")
# GRPC_VERSION ("v1.65.0")
# GRPC_MAKEFLAGS ("-j4 -Otarget")
# APT_MIRROR / APT_PORTS_MIRROR (optional; consumed by /usr/local/sbin/apt-mirror)
# AMDGPU_TARGETS (optional; only relevant for hipblas downstream)
#
# IMPORTANT: install logic is copied verbatim from the prior in-Dockerfile
# RUN blocks. Do not paraphrase apt invocations / version pins / sed line
# numbers / deb URLs — the bit-equivalence guarantee depends on it.
set -eux
# --- 0. apt mirror rewrite (no-op when APT_MIRROR / APT_PORTS_MIRROR unset) ---
if [ -x /usr/local/sbin/apt-mirror ]; then
APT_MIRROR="${APT_MIRROR:-}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR:-}" \
sh /usr/local/sbin/apt-mirror
fi
export DEBIAN_FRONTEND=noninteractive
export MAKEFLAGS="${GRPC_MAKEFLAGS:-}"
# --- 1. Base apt build deps ---
apt-get update
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget
apt-get clean
rm -rf /var/lib/apt/lists/*
# --- 2. Vulkan SDK (BUILD_TYPE=vulkan) ---
# NB: this block intentionally installs `cmake` via apt as part of the
# Vulkan tooling — must run before the dedicated CMake step below.
if [ "${BUILD_TYPE:-}" = "vulkan" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
apt-get update
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "${TARGETARCH:-}" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz"
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz
mkdir -p /opt/vulkan-sdk
mv 1.4.335.0 /opt/vulkan-sdk/
( cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc )
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "${TARGETARCH:-}" ]; then
mkdir vulkan
( cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ )
rm -rf vulkan
fi
ldconfig
apt-get clean
rm -rf /var/lib/apt/lists/*
fi
# --- 3. CUDA toolkit (BUILD_TYPE=cublas|l4t) ---
if { [ "${BUILD_TYPE:-}" = "cublas" ] || [ "${BUILD_TYPE:-}" = "l4t" ]; } && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
apt-get update
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "${TARGETARCH:-}" ]; then
curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb"
fi
if [ "arm64" = "${TARGETARCH:-}" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb"
else
curl -O "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb"
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb
rm -f cuda-keyring_1.1-1_all.deb
apt-get update
apt-get install -y --no-install-recommends \
"cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "${TARGETARCH:-}" ]; then
apt-get install -y --no-install-recommends \
"libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libcudnn9-cuda-${CUDA_MAJOR_VERSION}" \
"cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}" \
"libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}"
fi
apt-get clean
rm -rf /var/lib/apt/lists/*
fi
# --- 4. cuDSS / NVPL on arm64 + cublas (legacy JetPack / Tegra) ---
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
if [ "${BUILD_TYPE:-}" = "cublas" ] && [ "${TARGETARCH:-}" = "arm64" ]; then
wget "https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
dpkg -i "cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb"
cp /var/cudss-local-tegra-repo-ubuntu"${UBUNTU_VERSION}"-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cudss "cudss-cuda-${CUDA_MAJOR_VERSION}"
wget "https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
dpkg -i "nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb"
cp /var/nvpl-local-repo-ubuntu"${UBUNTU_VERSION}"-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get install -y nvpl
fi
# --- 5. clBLAS (BUILD_TYPE=clblas) ---
# Present in variant Dockerfiles' from-source path but not in master's
# Dockerfile.base-grpc-builder. No CI matrix entry currently uses this,
# but keep parity so a future BUILD_TYPE=clblas build doesn't drift.
if [ "${BUILD_TYPE:-}" = "clblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
apt-get update
apt-get install -y --no-install-recommends \
libclblast-dev
apt-get clean
rm -rf /var/lib/apt/lists/*
fi
# --- 6. ROCm / HIP build deps (BUILD_TYPE=hipblas) ---
if [ "${BUILD_TYPE:-}" = "hipblas" ] && [ "${SKIP_DRIVERS:-false}" = "false" ]; then
apt-get update
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev
apt-get clean
rm -rf /var/lib/apt/lists/*
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install,
# which results in local-ai and others not being able to locate the libraries.
# We run ldconfig ourselves to work around this packaging deficiency.
ldconfig
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:"
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found"
fi
echo "TARGETARCH: ${TARGETARCH:-}"
# --- 7. protoc (always) ---
# The version in 22.04 is too old. We will create one as part of installing
# the GRPC build below but that will also bring in a newer version of absl
# which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build.
if [ "amd64" = "${TARGETARCH:-}" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip
unzip -j -d /usr/local/bin protoc.zip bin/protoc
rm protoc.zip
fi
if [ "arm64" = "${TARGETARCH:-}" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip
unzip -j -d /usr/local/bin protoc.zip bin/protoc
rm protoc.zip
fi
# --- 8. CMake (apt or compiled from source) ---
# The version in 22.04 is too old. Vulkan path above already pulled cmake
# via apt; the from-source branch here will install over it which is fine.
if [ "${CMAKE_FROM_SOURCE:-false}" = "true" ]; then
curl -L -s "https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz" -o cmake.tar.gz
tar xvf cmake.tar.gz
( cd "cmake-${CMAKE_VERSION}" && ./configure && make && make install )
else
apt-get update
apt-get install -y \
cmake
apt-get clean
rm -rf /var/lib/apt/lists/*
fi
# --- 9. gRPC compile + install at /opt/grpc ---
# We install GRPC to a different prefix here so that we can copy in only
# the build artifacts later — saves several hundred MB on the final docker
# image size vs copying in the entire GRPC source tree and running
# `make install` in the target container.
#
# The TESTONLY abseil sed patch and /opt/grpc prefix are load-bearing —
# downstream Dockerfiles `COPY` /opt/grpc to /usr/local (or rely on the
# prebuilt base having it at /opt/grpc).
mkdir -p /build
cd /build
git clone --recurse-submodules --jobs 4 -b "${GRPC_VERSION}" --depth 1 --shallow-submodules https://github.com/grpc/grpc
mkdir -p /build/grpc/cmake/build
cd /build/grpc/cmake/build
sed -i "216i\\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt"
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../..
make
make install
cd /
rm -rf /build

35
.docker/llama-cpp-compile.sh Executable file
View File

@@ -0,0 +1,35 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.llama-cpp.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
set -euxo pipefail
export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
fi
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
cd /LocalAI/backend/cpp/llama-cpp
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
else
cd /LocalAI/backend/cpp/llama-cpp
make llama-cpp-avx
make llama-cpp-avx2
make llama-cpp-avx512
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
fi
ccache -s || true

35
.docker/turboquant-compile.sh Executable file
View File

@@ -0,0 +1,35 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.turboquant.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
set -euxo pipefail
export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/turboquant-*-build
fi
cd /LocalAI/backend/cpp/turboquant
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
else
make turboquant-avx
make turboquant-avx2
make turboquant-avx512
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
fi
ccache -s || true

View File

@@ -0,0 +1,100 @@
name: 'Configure apt mirror'
description: |
Reconfigure the GitHub Actions runner's Ubuntu apt sources to use an
alternate mirror, and emit the effective URLs as outputs so callers can
forward them as Docker build-args.
Two mirror profiles depending on where the runner lives, because the
best mirror differs by network:
* github-hosted runners run on Azure, so they default to the
Azure-hosted Ubuntu mirror (lowest latency, same VPC).
* self-hosted runners (arc-runner-set, bigger-runner, ...) typically
cannot route to azure.archive.ubuntu.com, so they default to the
kernel.org mirror, which is publicly reachable from anywhere.
Pass an empty string to either input to skip the rewrite for that
profile and keep upstream archive.ubuntu.com / ports.ubuntu.com.
inputs:
github-hosted-mirror:
description: 'archive/security mirror URL for github-hosted runners (empty = upstream)'
required: false
default: 'http://azure.archive.ubuntu.com'
github-hosted-ports-mirror:
description: 'ports.ubuntu.com mirror URL for github-hosted runners (empty = upstream)'
required: false
default: 'http://azure.ports.ubuntu.com'
self-hosted-mirror:
description: 'archive/security mirror URL for self-hosted runners (empty = upstream)'
required: false
# HTTP, not HTTPS: the bare ubuntu:24.04 builder image doesn't ship
# ca-certificates, so the very first apt-get update over TLS would
# fail with "No system certificates available" before it can install
# anything. apt validates package integrity via GPG signatures, so
# plain HTTP is safe for the archive itself.
default: 'http://mirrors.edge.kernel.org'
self-hosted-ports-mirror:
description: 'ports.ubuntu.com mirror URL for self-hosted runners (empty = upstream)'
required: false
# mirrors.edge.kernel.org does NOT carry /ubuntu-ports/ — only the
# main /ubuntu/ archive — so arm64 builds 404 there. Leave ports
# upstream by default. The original DDoS was on archive.ubuntu.com
# so ports.ubuntu.com remains the path of least surprise.
default: ''
outputs:
effective-mirror:
description: 'The mirror URL actually applied for this runner (or empty)'
value: ${{ steps.pick.outputs.mirror }}
effective-ports-mirror:
description: 'The ports mirror URL actually applied for this runner (or empty)'
value: ${{ steps.pick.outputs.ports-mirror }}
runs:
using: 'composite'
steps:
- name: Pick effective mirror for this runner
id: pick
shell: bash
env:
RUNNER_ENV: ${{ runner.environment }}
GH_MIRROR: ${{ inputs.github-hosted-mirror }}
GH_PORTS_MIRROR: ${{ inputs.github-hosted-ports-mirror }}
SH_MIRROR: ${{ inputs.self-hosted-mirror }}
SH_PORTS_MIRROR: ${{ inputs.self-hosted-ports-mirror }}
run: |
if [ "${RUNNER_ENV}" = "github-hosted" ]; then
MIRROR="${GH_MIRROR}"
PORTS_MIRROR="${GH_PORTS_MIRROR}"
else
MIRROR="${SH_MIRROR}"
PORTS_MIRROR="${SH_PORTS_MIRROR}"
fi
echo "configure-apt-mirror: runner=${RUNNER_ENV} mirror='${MIRROR}' ports-mirror='${PORTS_MIRROR}'"
echo "mirror=${MIRROR}" >> "$GITHUB_OUTPUT"
echo "ports-mirror=${PORTS_MIRROR}" >> "$GITHUB_OUTPUT"
- name: Rewrite apt sources
if: steps.pick.outputs.mirror != '' || steps.pick.outputs.ports-mirror != ''
shell: bash
env:
APT_MIRROR: ${{ steps.pick.outputs.mirror }}
APT_PORTS_MIRROR: ${{ steps.pick.outputs.ports-mirror }}
run: |
set -e
# Ubuntu 24.04 (noble) ships DEB822 sources at
# /etc/apt/sources.list.d/ubuntu.sources; older releases use
# /etc/apt/sources.list. Rewrite whichever exists.
for f in /etc/apt/sources.list.d/ubuntu.sources /etc/apt/sources.list; do
sudo test -f "$f" || continue
if [ -n "${APT_MIRROR}" ]; then
# Comma delimiter so the alternation pipe in the regex is not
# interpreted as the s/// separator.
sudo sed -i -E "s,https?://(archive\.ubuntu\.com|security\.ubuntu\.com),${APT_MIRROR},g" "$f"
fi
if [ -n "${APT_PORTS_MIRROR}" ]; then
sudo sed -i -E "s,https?://ports\.ubuntu\.com,${APT_PORTS_MIRROR},g" "$f"
fi
done
echo "Runner apt mirror configured (APT_MIRROR='${APT_MIRROR}', APT_PORTS_MIRROR='${APT_PORTS_MIRROR}')"

View File

@@ -0,0 +1,65 @@
name: 'Free disk space on hosted runners'
description: |
Aggressively clean GitHub-hosted ubuntu-latest runners to reclaim ~6-10 GB
of working space before docker buildx steps. Combines jlumbroso/free-disk-space
with explicit apt purges of large packages we never use (dotnet, ghc, mono,
android, jdk, ...).
No-op on self-hosted runners; pass mode=skip to force-disable.
inputs:
mode:
description: 'hosted (default — clean) or skip (no-op)'
required: false
default: 'hosted'
runs:
using: 'composite'
steps:
- name: Free Disk Space (Ubuntu)
if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
uses: jlumbroso/free-disk-space@main
with:
tool-cache: true
android: true
dotnet: true
haskell: true
large-packages: true
docker-images: true
swap-storage: true
- name: Release space from worker
if: inputs.mode == 'hosted' && runner.environment == 'github-hosted'
shell: bash
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
df -h
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get remove -y microsoft-edge-stable || true
sudo apt-get remove -y firefox || true
sudo apt-get remove -y powershell || true
sudo apt-get remove -y r-base-core || true
sudo apt-get autoremove -y
sudo apt-get clean
sudo rm -rfv build || true
sudo rm -rf /usr/share/dotnet || true
sudo rm -rf /opt/ghc || true
sudo rm -rf "/usr/local/share/boost" || true
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
df -h

View File

@@ -0,0 +1,59 @@
name: 'Set up build disk on hosted runners'
description: |
Relocate Docker's data-root to /mnt (which has ~75 GB free, vs ~20 GB
on / after free-disk-space). Combined with the apt cleanup, gives
~100 GB working space for buildx — enough for ROCm dev image + vLLM
torch install + flash-attn build.
No-op on:
- self-hosted runners (no /mnt expectation)
- non-X64 runners (verify /mnt shape on ubuntu-24.04-arm separately
before enabling there — see Task 3.2 in the migration plan)
- mode=skip (force-disable from caller)
Must run after free-disk-space (which removes large packages — would
fail mid-uninstall if Docker were stopped) and before any Docker
operation (setup-qemu, setup-buildx, login, build) so the relocated
data-root catches all subsequent docker activity.
inputs:
mode:
description: 'auto (default — relocate on hosted X64 only) or skip'
required: false
default: 'auto'
runs:
using: 'composite'
steps:
- name: Relocate Docker data-root to /mnt
if: inputs.mode == 'auto' && runner.environment == 'github-hosted' && runner.arch == 'X64'
shell: bash
run: |
set -euo pipefail
echo "Before relocation:"
df -h / /mnt || true
sudo systemctl stop docker docker.socket
sudo mkdir -p /mnt/docker-data /mnt/docker-tmp
# buildx CLI runs as the unprivileged runner user and creates
# config dirs under TMPDIR before binding them into the buildkit
# container. /mnt is owned by root by default; mirror /tmp's
# 1777 (world-writable + sticky) so non-root processes can write.
sudo chmod 1777 /mnt/docker-tmp
if [ -d /var/lib/docker ] && [ ! -L /var/lib/docker ]; then
sudo rsync -a /var/lib/docker/ /mnt/docker-data/
sudo rm -rf /var/lib/docker
sudo ln -s /mnt/docker-data /var/lib/docker
fi
# daemon.json may not exist; merge data-root in or create minimal.
if [ -f /etc/docker/daemon.json ]; then
sudo jq '."data-root" = "/mnt/docker-data"' /etc/docker/daemon.json | sudo tee /etc/docker/daemon.json.new >/dev/null
sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json
else
echo '{"data-root":"/mnt/docker-data"}' | sudo tee /etc/docker/daemon.json
fi
sudo systemctl start docker
# Make TMPDIR persist for subsequent steps in the same job.
echo "TMPDIR=/mnt/docker-tmp" >> "$GITHUB_ENV"
echo "After relocation:"
df -h / /mnt
docker info | grep -i 'docker root dir' || true

3941
.github/backend-matrix.yml vendored Normal file
View File

File diff suppressed because it is too large Load Diff

45
.github/bump_vllm_wheel.sh vendored Executable file
View File

@@ -0,0 +1,45 @@
#!/bin/bash
# Bump the cublas13 vLLM wheel pin in requirements-cublas13-after.txt.
#
# vLLM's PyPI wheel is built against CUDA 12 so the cublas13 build pulls a
# cu130-flavoured wheel from vLLM's per-tag index at
# https://wheels.vllm.ai/<TAG>/cu130/. That URL segment is itself version-locked
# (no /latest/ alias upstream), so bumping vLLM means rewriting both the URL
# segment and the version constraint atomically. bump_deps.sh handles git-sha
# vars in Makefiles; this script handles the two-value rewrite specific to the
# vLLM requirements file.
set -xe
REPO=$1 # vllm-project/vllm
FILE=$2 # backend/python/vllm/requirements-cublas13-after.txt
VAR=$3 # VLLM_VERSION (used for output file names so the workflow can read them)
if [ -z "$FILE" ] || [ -z "$REPO" ] || [ -z "$VAR" ]; then
echo "usage: $0 <repo> <requirements-file> <var-name>" >&2
exit 1
fi
# /releases/latest returns the most recent non-prerelease tag.
LATEST_TAG=$(curl -sS -H "Accept: application/vnd.github+json" \
"https://api.github.com/repos/$REPO/releases/latest" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['tag_name'])")
# Strip leading 'v' (vLLM tags are 'v0.20.0', the URL/version use '0.20.0').
NEW_VERSION="${LATEST_TAG#v}"
set +e
CURRENT_VERSION=$(grep -oE '^vllm==[0-9]+\.[0-9]+\.[0-9]+' "$FILE" | head -1 | cut -d= -f3)
set -e
# sed both lines unconditionally — peter-evans/create-pull-request opens no PR
# when the working tree is clean, so a no-op rewrite is safe.
sed -i "$FILE" \
-e "s|wheels\.vllm\.ai/[^/]*/cu130|wheels.vllm.ai/$NEW_VERSION/cu130|g" \
-e "s|^vllm==.*|vllm==$NEW_VERSION|"
if [ -z "$CURRENT_VERSION" ]; then
echo "Could not find vllm==X.Y.Z in $FILE."
exit 0
fi
echo "Changes: https://github.com/$REPO/compare/v${CURRENT_VERSION}...${LATEST_TAG}" >> "${VAR}_message.txt"
echo "${NEW_VERSION}" >> "${VAR}_commit.txt"

46
.github/scripts/anchor-digest-in-cache.sh vendored Executable file
View File

@@ -0,0 +1,46 @@
#!/usr/bin/env bash
# Anchor a backend per-arch digest in quay.io/go-skynet/ci-cache so quay's
# garbage collector won't reap the manifest before backend_merge.yml runs.
#
# Context: backend_build.yml pushes by canonical digest only
# (push-by-digest=true). Unreferenced manifests on quay can be reaped within
# ~1-2h, but backend-merge-jobs runs only after the *entire* per-arch build
# matrix drains (max-parallel: 8 × dozens of entries → ~2h+). Without an
# anchoring tag, the earliest digests are gone by the time `imagetools create`
# tries to read them, producing "manifest not found" merge failures.
#
# We tag the digest under our internal ci-cache image; quay does not GC tagged
# manifests. The user-facing manifest list still references the original
# digest in local-ai-backends. backend_merge.yml deletes the anchor tag after
# the user-facing manifest is published — see cleanup-keepalive-tags.sh.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# PLATFORM_TAG - amd64 / arm64 / single (single = singleton matrix entry)
# DIGEST - canonical content digest from build step (sha256:...)
#
# Optional env:
# ANCHOR_IMAGE - target image (default: quay.io/go-skynet/ci-cache)
# SOURCE_IMAGE - source image (default: quay.io/go-skynet/local-ai-backends)
# GITHUB_STEP_SUMMARY - if set, an anchored-by line is appended to it
set -euo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${PLATFORM_TAG:?}"
: "${DIGEST:?}"
anchor_image="${ANCHOR_IMAGE:-quay.io/go-skynet/ci-cache}"
source_image="${SOURCE_IMAGE:-quay.io/go-skynet/local-ai-backends}"
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${PLATFORM_TAG}"
docker buildx imagetools create \
-t "${anchor_image}:${tag}" \
"${source_image}@${DIGEST}"
echo "anchored ${DIGEST} as ${anchor_image}:${tag}"
if [[ -n "${GITHUB_STEP_SUMMARY:-}" ]]; then
echo "anchored \`${DIGEST}\` as \`${anchor_image}:${tag}\`" >> "${GITHUB_STEP_SUMMARY}"
fi

49
.github/scripts/cleanup-keepalive-tags.sh vendored Executable file
View File

@@ -0,0 +1,49 @@
#!/usr/bin/env bash
# Best-effort cleanup of the keepalive anchor tags written by
# anchor-digest-in-cache.sh. Called from backend_merge.yml after the
# user-facing manifest list has been published.
#
# Quay's docker registry v2 doesn't allow tag deletes — only digest deletes.
# The proper delete is the quay REST API, which requires an OAuth-scoped
# token. We try QUAY_TOKEN as a bearer token: if the secret is an OAuth app
# token (typical for service accounts) the delete succeeds; otherwise this
# is a soft no-op and the tag persists until manually pruned.
#
# Cleanup failure MUST NOT fail the merge — the merge has already produced
# the user-facing manifest list at this point and the keepalive tags are
# pure overhead. We always exit 0.
#
# Required env:
# GITHUB_RUN_ID - current workflow run id (set automatically by GHA)
# TAG_SUFFIX - matrix entry's tag-suffix (e.g. -gpu-nvidia-cuda-12-vllm)
# QUAY_TOKEN - bearer token for quay's REST API
#
# Optional env:
# QUAY_REPO - target repo (default: go-skynet/ci-cache)
# PLATFORM_TAGS - space-separated list of platform-tag values to try
# (default: "amd64 arm64 single")
# We don't know which platform-tag(s) exist for this
# tag-suffix without an extra API call, so we just try
# all three and ignore 404s for the ones that don't.
set -uo pipefail
: "${GITHUB_RUN_ID:?}"
: "${TAG_SUFFIX:?}"
: "${QUAY_TOKEN:?}"
quay_repo="${QUAY_REPO:-go-skynet/ci-cache}"
platform_tags="${PLATFORM_TAGS:-amd64 arm64 single}"
for plat in $platform_tags; do
tag="keepalive-${GITHUB_RUN_ID}${TAG_SUFFIX}-${plat}"
url="https://quay.io/api/v1/repository/${quay_repo}/tag/${tag}"
http=$(curl -sS -o /dev/null -w '%{http_code}' \
-X DELETE -H "Authorization: Bearer ${QUAY_TOKEN}" "$url" || echo "000")
case "$http" in
204|200) echo "deleted $tag" ;;
404) echo "not present: $tag" ;;
401|403) echo "auth not OAuth-scoped (http $http) for $tag - skipping; orphan tag will persist" ;;
*) echo "unexpected http $http deleting $tag - skipping" ;;
esac
done
exit 0

View File

File diff suppressed because it is too large Load Diff

View File

@@ -24,6 +24,17 @@ on:
description: 'Platforms'
default: ''
type: string
platform-tag:
description: |
Short tag identifying the platform leg, e.g. "amd64" or "arm64".
Used to scope the per-arch registry cache and the digest artifact name.
Required for split-and-merge multi-arch builds; pass "amd64" for
single-arch amd64 builds too. Optional (default '') during the
migration to per-arch matrix expansion; will be flipped to
required: true in Phase 6 once all callers pass an explicit value.
required: false
default: ''
type: string
tag-latest:
description: 'Tag latest'
default: ''
@@ -61,7 +72,16 @@ on:
amdgpu-targets:
description: 'AMD GPU targets for ROCm/HIP builds'
required: false
default: 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201'
default: ''
type: string
builder-base-image:
description: |
Pre-built builder base image (e.g. quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64).
When set, the variant Dockerfile uses its `builder-prebuilt` stage which FROMs this
image directly instead of running its own gRPC stage + apt installs. Empty for
backends whose Dockerfile doesn't support a prebuilt base.
required: false
default: ''
type: string
secrets:
dockerUsername:
@@ -80,76 +100,22 @@ jobs:
quay_username: ${{ secrets.quayUsername }}
steps:
- name: Free Disk Space (Ubuntu)
if: inputs.runs-on == 'ubuntu-latest'
uses: jlumbroso/free-disk-space@main
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
tool-cache: true
# all of these default to true, but feel free to set to
# "false" if necessary for your workflow
android: true
dotnet: true
haskell: true
large-packages: true
docker-images: true
swap-storage: true
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- name: Checkout
uses: actions/checkout@v6
with:
submodules: true
- name: Release space from worker
if: inputs.runs-on == 'ubuntu-latest'
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get remove -y microsoft-edge-stable || true
sudo apt-get remove -y firefox || true
sudo apt-get remove -y powershell || true
sudo apt-get remove -y r-base-core || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
sudo rm -rf /usr/share/dotnet || true
sudo rm -rf /opt/ghc || true
sudo rm -rf "/usr/local/share/boost" || true
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
df -h
- name: Configure apt mirror on runner
id: apt_mirror
uses: ./.github/actions/configure-apt-mirror
- name: Free disk space
uses: ./.github/actions/free-disk-space
with:
mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
- name: Set up build disk
uses: ./.github/actions/setup-build-disk
- name: Docker meta
id: meta
@@ -206,7 +172,17 @@ jobs:
username: ${{ secrets.quayUsername }}
password: ${{ secrets.quayPassword }}
- name: Build and push
# Weekly cache-buster for the per-backend `make` step. Most Python
# backends list unpinned deps (torch, transformers, vllm, ...), so a
# warm cache freezes upstream versions indefinitely. Rolling this
# weekly forces a re-resolve of the install layer at most once per
# week, picking up newer wheels without a full cold rebuild.
- name: Compute deps refresh key
id: deps_refresh
run: echo "key=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
- name: Build and push by digest
id: build
uses: docker/build-push-action@v7
if: github.event_name != 'pull_request'
with:
@@ -220,15 +196,65 @@ jobs:
BACKEND=${{ inputs.backend }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
context: ${{ inputs.context }}
file: ${{ inputs.dockerfile }}
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
platforms: ${{ inputs.platforms }}
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
outputs: |
type=image,name=quay.io/go-skynet/local-ai-backends,push-by-digest=true,name-canonical=true,push=true
type=image,name=localai/localai-backends,push-by-digest=true,name-canonical=true,push=true
# Disable provenance: with mode=max (the default for push:true)
# buildx bundles a per-registry attestation manifest into each
# registry's manifest list, which makes the resulting list digest
# diverge across registries. steps.build.outputs.digest then
# only matches one of them, and the merge job's
# `imagetools create <reg>@sha256:<digest>` lookup fails on the
# other. Disabling provenance keeps the digest content-only and
# identical across both registries — required for digest-based
# cross-registry merge.
provenance: false
labels: ${{ steps.meta.outputs.labels }}
- name: Build and push (PR)
- name: Export digest
if: github.event_name != 'pull_request'
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with backend_merge.yml's cleanup step.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
run: .github/scripts/anchor-digest-in-cache.sh
# Artifact name uses a `--` separator between tag-suffix and platform-tag
# to avoid prefix collisions during the merge job's pattern-based download.
# Tag-suffixes are not prefix-disjoint (e.g. -gpu-nvidia-cuda-12-vllm is a
# prefix of -gpu-nvidia-cuda-12-vllm-omni); a single `-` separator plus the
# merge-side `digests<tag-suffix>-*` glob would let one merge over-match
# the other backend's artifacts. The `-single` placeholder for empty
# platform-tag (single-arch entries) keeps the artifact name non-trailing.
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v7
with:
name: digests${{ inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
- name: Build (PR)
uses: docker/build-push-action@v7
if: github.event_name == 'pull_request'
with:
@@ -242,9 +268,14 @@ jobs:
BACKEND=${{ inputs.backend }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
AMDGPU_TARGETS=${{ inputs.amdgpu-targets }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
DEPS_REFRESH=${{ steps.deps_refresh.outputs.key }}
BUILDER_BASE_IMAGE=${{ inputs.builder-base-image }}
BUILDER_TARGET=${{ inputs.builder-base-image != '' && 'builder-prebuilt' || 'builder-fromsource' }}
context: ${{ inputs.context }}
file: ${{ inputs.dockerfile }}
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
platforms: ${{ inputs.platforms }}
push: ${{ env.quay_username != '' }}
tags: ${{ steps.meta_pull_request.outputs.tags }}

View File

@@ -48,6 +48,13 @@ jobs:
strategy:
matrix:
go-version: ['${{ inputs.go-version }}']
env:
# Keep the brew Cellar stable across cache restores. Without these,
# `brew install` would auto-update brew itself and re-link formulas,
# mutating the very paths the cache just restored.
HOMEBREW_NO_AUTO_UPDATE: '1'
HOMEBREW_NO_INSTALL_CLEANUP: '1'
HOMEBREW_NO_ANALYTICS: '1'
steps:
- name: Clone
uses: actions/checkout@v6
@@ -58,21 +65,190 @@ jobs:
uses: actions/setup-go@v5
with:
go-version: ${{ matrix.go-version }}
cache: false
# Caches ~/go/pkg/mod and ~/Library/Caches/go-build keyed on go.sum.
# Shared across every darwin matrix entry — first job in a run warms
# it, the rest hit warm.
cache: true
# You can test your matrix by printing the current Go version
- name: Display Go version
run: go version
# ---- Homebrew cache ----
# macOS runners have no Docker daemon, so the BuildKit registry cache used
# for Linux backend images (see .agents/ci-caching.md) doesn't apply here.
# We cache the brew downloads + Cellar entries for the formulas we install
# below. Read on every run, write only on master/tag pushes — same policy
# as the Linux registry cache.
- name: Restore Homebrew cache
id: brew-cache
uses: actions/cache/restore@v4
with:
path: |
~/Library/Caches/Homebrew/downloads
/opt/homebrew/Cellar/protobuf
/opt/homebrew/Cellar/grpc
/opt/homebrew/Cellar/protoc-gen-go
/opt/homebrew/Cellar/protoc-gen-go-grpc
/opt/homebrew/Cellar/libomp
/opt/homebrew/Cellar/llvm
/opt/homebrew/Cellar/ccache
/opt/homebrew/Cellar/blake3
/opt/homebrew/Cellar/fmt
/opt/homebrew/Cellar/hiredis
/opt/homebrew/Cellar/xxhash
/opt/homebrew/Cellar/zstd
key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
- name: Dependencies
run: |
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm
# ccache is always installed (used by the llama-cpp variant build) so
# the brew cache content stays stable across every backend in the
# matrix — they all share one cache key.
# blake3, fmt, hiredis, xxhash, zstd are ccache's runtime dylib deps.
# Without explicitly installing them, a brew cache-hit run restores
# ccache's Cellar dir but skips installing those transitive deps,
# and ccache fails at runtime with `dyld: Library not loaded`.
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
# Force-reinstall ccache so brew re-validates its full runtime-dep
# closure on every run. This is the durable fix: when the upstream
# ccache formula gains a new transitive dep (as it has multiple times
# already), we don't have to chase missing dylibs one at a time.
# The downloads cache makes the reinstall fast (~5s on a hit).
brew reinstall ccache
# Same pattern for grpc: its CMake config (used by the llama-cpp
# `grpc-server` target) does find_package(absl). The cache restores
# /opt/homebrew/Cellar/grpc so brew above no-ops the install, but
# abseil isn't in our Cellar cache list and never gets installed
# alongside, leaving grpc's CMake unable to resolve it. Reinstalling
# grpc re-validates and pulls abseil in, mirroring the ccache fix.
brew reinstall grpc
# The brew cache restores the Cellar dirs but NOT the bin symlinks
# at /opt/homebrew/bin/*. brew install above sees the Cellar present
# and decides "already installed" without re-linking, so on a cache-
# hit run the formulas aren't on PATH. Force-link them; --overwrite
# tolerates pre-existing symlinks from earlier installs.
brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
- name: Save Homebrew cache
if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
uses: actions/cache/save@v4
with:
path: |
~/Library/Caches/Homebrew/downloads
/opt/homebrew/Cellar/protobuf
/opt/homebrew/Cellar/grpc
/opt/homebrew/Cellar/protoc-gen-go
/opt/homebrew/Cellar/protoc-gen-go-grpc
/opt/homebrew/Cellar/libomp
/opt/homebrew/Cellar/llvm
/opt/homebrew/Cellar/ccache
/opt/homebrew/Cellar/blake3
/opt/homebrew/Cellar/fmt
/opt/homebrew/Cellar/hiredis
/opt/homebrew/Cellar/xxhash
/opt/homebrew/Cellar/zstd
key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}
# ---- ccache for llama.cpp CMake builds ----
# Three CMake variants (fallback, grpc, rpc-server) compile the same
# llama.cpp source tree with overlapping flags — ccache dedupes object
# files across them. Key on the pinned LLAMA_VERSION so a pin bump
# invalidates cleanly; restore-keys fall back to the latest entry for the
# same pin so unchanged TUs stay warm even when the cache is fresh.
- name: Compute llama.cpp version
if: inputs.backend == 'llama-cpp'
id: llama-version
run: |
version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
echo "version=${version}" >> "$GITHUB_OUTPUT"
- name: Restore ccache
if: inputs.backend == 'llama-cpp'
id: ccache-cache
uses: actions/cache/restore@v4
with:
path: ~/Library/Caches/ccache
key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
restore-keys: |
ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-
- name: Configure ccache
if: inputs.backend == 'llama-cpp'
run: |
mkdir -p "$HOME/Library/Caches/ccache"
ccache -M 2G
ccache -z
# llama-cpp-darwin.sh reads CMAKE_ARGS / CCACHE_DIR from env.
{
echo "CMAKE_ARGS=${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache"
echo "CCACHE_DIR=$HOME/Library/Caches/ccache"
} >> "$GITHUB_ENV"
# ---- Python wheel cache (uv + pip) ----
# Mirrors the Linux DEPS_REFRESH cadence (see .agents/ci-caching.md): the
# ISO-week segment of the cache key forces at most one cold rebuild per
# backend per week, automatically picking up newer wheels for unpinned
# deps (torch, mlx, diffusers, …). Restore-keys fall back to the most
# recent build of the same backend so off-week PRs still hit warm.
- name: Compute weekly cache bucket
if: inputs.lang == 'python'
id: weekly
run: echo "bucket=$(date -u +%Y-W%V)" >> "$GITHUB_OUTPUT"
- name: Restore Python wheel cache
if: inputs.lang == 'python'
id: pyenv-cache
uses: actions/cache/restore@v4
with:
path: |
~/Library/Caches/pip
~/Library/Caches/uv
key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
restore-keys: |
pyenv-darwin-${{ inputs.backend }}-
# llama-cpp on Darwin uses a bespoke build script (scripts/build/llama-cpp-darwin.sh)
# that compiles three CMake variants from backend/cpp/llama-cpp and bundles dylibs
# via otool — it doesn't fit the build-darwin-go-backend / build-darwin-python-backend
# mold. Drive it via its dedicated `backends/llama-cpp-darwin` make target instead.
- name: Build ${{ inputs.backend }}-darwin (llama-cpp)
if: inputs.backend == 'llama-cpp'
run: |
make protogen-go
make backends/llama-cpp-darwin
- name: Build ds4 backend (Darwin Metal)
if: inputs.backend == 'ds4'
run: |
make backends/ds4-darwin
- name: Build ${{ inputs.backend }}-darwin
if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
run: |
make protogen-go
BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
- name: ccache stats
if: inputs.backend == 'llama-cpp'
run: ccache -s
- name: Save ccache
if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
uses: actions/cache/save@v4
with:
path: ~/Library/Caches/ccache
key: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-${{ github.run_id }}
- name: Save Python wheel cache
if: inputs.lang == 'python' && github.event_name != 'pull_request' && steps.pyenv-cache.outputs.cache-hit != 'true'
uses: actions/cache/save@v4
with:
path: |
~/Library/Caches/pip
~/Library/Caches/uv
key: pyenv-darwin-${{ inputs.backend }}-${{ steps.weekly.outputs.bucket }}-${{ hashFiles(format('backend/python/{0}/requirements*.txt', inputs.backend)) }}
- name: Upload ${{ inputs.backend }}.tar
uses: actions/upload-artifact@v7
with:

213
.github/workflows/backend_merge.yml vendored Normal file
View File

@@ -0,0 +1,213 @@
---
name: 'merge backend manifest list (reusable)'
# Reusable workflow that joins per-arch digest artifacts (uploaded by
# backend_build.yml when called with platform-tag) into a single tagged
# multi-arch manifest list. Called once per backend by backend.yml after
# both per-arch build jobs succeed.
on:
workflow_call:
inputs:
tag-latest:
description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
required: false
type: string
default: ''
tag-suffix:
description: 'Backend tag suffix (e.g. -cpu-faster-whisper). Used to compute the artifact pattern and the final tag suffix.'
required: true
type: string
secrets:
dockerUsername:
required: false
dockerPassword:
required: false
quayUsername:
required: true
quayPassword:
required: true
jobs:
merge:
runs-on: ubuntu-latest
# id-token: write is required for keyless cosign — the workflow
# exchanges the GitHub OIDC token for a short-lived Fulcio cert that
# signs each pushed manifest. Without this permission the runner
# cannot mint the token, and `cosign sign` fails with "no token".
permissions:
contents: read
id-token: write
env:
quay_username: ${{ secrets.quayUsername }}
steps:
# Sparse checkout: the merge job needs `.github/scripts/` (for the
# keepalive cleanup script) but none of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
# `--` separator anchors the glob so we don't over-match sibling
# backends whose tag-suffix happens to be a prefix of ours
# (e.g. -cpu-vllm vs -cpu-vllm-omni). Must stay in sync with the
# upload-artifact name in backend_build.yml.
- name: Download digests
uses: actions/download-artifact@v8
with:
pattern: digests${{ inputs.tag-suffix }}--*
merge-multiple: true
path: /tmp/digests
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@master
# cosign signs each pushed manifest list with --recursive so the
# index and every per-arch entry get an attached Sigstore bundle.
# 2.2+ is required for --new-bundle-format.
- name: Install cosign
if: github.event_name != 'pull_request'
uses: sigstore/cosign-installer@v3
with:
cosign-release: 'v2.4.1'
- name: Login to DockerHub
if: github.event_name != 'pull_request'
uses: docker/login-action@v4
with:
username: ${{ secrets.dockerUsername }}
password: ${{ secrets.dockerPassword }}
- name: Login to Quay.io
if: ${{ env.quay_username != '' }}
uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ secrets.quayUsername }}
password: ${{ secrets.quayPassword }}
- name: Docker meta
id: meta
if: github.event_name != 'pull_request'
uses: docker/metadata-action@v6
with:
images: |
quay.io/go-skynet/local-ai-backends
localai/localai-backends
tags: |
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai-backends.
#
# The build job pushes per-arch manifests to local-ai-backends with
# push-by-digest=true (no tag), then anchors a tagged copy into
# ci-cache so the manifest can be retrieved hours later when this
# merge runs. Quay's manifest GC, however, is per-repository: the
# anchor tag in ci-cache protects the manifest there, but the same
# digest in local-ai-backends has no tag in *that* repo and gets
# reaped independently. Sourcing local-ai-backends@<digest> here
# then fails with "manifest not found" — exactly the regression
# we hit on v4.2.2 (19/37 multiarch merges failed).
#
# ci-cache@<digest> resolves because we anchored it there. buildx
# imagetools create copies the manifest into local-ai-backends
# (cross-repo within the same registry, blobs already cross-mounted
# from the original push so no transfer needed) and publishes the
# manifest list with the user-facing tags. The resulting manifest
# list is fully self-contained in local-ai-backends — child digests
# only, no embedded references to ci-cache.
- name: Create manifest list and push (quay)
if: github.event_name != 'pull_request'
working-directory: /tmp/digests
run: |
set -euo pipefail
tags=$(jq -cr '
.tags
| map(select(startswith("quay.io/")))
| map("-t " + .)
| join(" ")
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No quay.io tags from docker/metadata-action; skipping quay merge"
exit 0
fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
# Resolve the manifest-list digest (any tag points at it) so
# cosign can sign by digest. Signing by tag would leave the
# signature orphaned the next time the tag moves.
first_tag=$(jq -cr '
.tags | map(select(startswith("quay.io/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
# --recursive walks the list and signs every per-arch entry
# too — clients that resolve a tag to a platform-specific
# manifest before checking signatures need the per-arch
# signatures, not just the list-level one.
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"quay.io/go-skynet/local-ai-backends@${digest}"
- name: Create manifest list and push (dockerhub)
if: github.event_name != 'pull_request'
working-directory: /tmp/digests
run: |
set -euo pipefail
tags=$(jq -cr '
.tags
| map(select(startswith("localai/")))
| map("-t " + .)
| join(" ")
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
exit 0
fi
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai-backends@sha256:%s ' *)
first_tag=$(jq -cr '
.tags | map(select(startswith("localai/"))) | .[0]
' <<< "$DOCKER_METADATA_OUTPUT_JSON")
digest=$(docker buildx imagetools inspect "$first_tag" --format '{{.Manifest.Digest}}')
cosign sign --yes --recursive \
--new-bundle-format \
--registry-referrers-mode=oci-1-1 \
"localai/localai-backends@${digest}"
- name: Inspect manifest
if: github.event_name != 'pull_request'
run: |
set -euo pipefail
first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
docker buildx imagetools inspect "$first_tag"
fi
# See .github/scripts/cleanup-keepalive-tags.sh for why this is
# best-effort and what the failure modes are.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary
if: github.event_name != 'pull_request'
run: |
set -euo pipefail
echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
echo >> "$GITHUB_STEP_SUMMARY"
echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"

View File

@@ -4,17 +4,23 @@ on:
pull_request:
concurrency:
group: ci-backends-pr-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-backends-pr-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
generate-matrix:
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.set-matrix.outputs.matrix }}
matrix-darwin: ${{ steps.set-matrix.outputs.matrix-darwin }}
has-backends: ${{ steps.set-matrix.outputs.has-backends }}
has-backends-darwin: ${{ steps.set-matrix.outputs.has-backends-darwin }}
matrix-singlearch: ${{ steps.set-matrix.outputs['matrix-singlearch'] }}
matrix-multiarch: ${{ steps.set-matrix.outputs['matrix-multiarch'] }}
matrix-darwin: ${{ steps.set-matrix.outputs['matrix-darwin'] }}
merge-matrix-multiarch: ${{ steps.set-matrix.outputs['merge-matrix-multiarch'] }}
merge-matrix-singlearch: ${{ steps.set-matrix.outputs['merge-matrix-singlearch'] }}
has-backends-singlearch: ${{ steps.set-matrix.outputs['has-backends-singlearch'] }}
has-backends-multiarch: ${{ steps.set-matrix.outputs['has-backends-multiarch'] }}
has-backends-darwin: ${{ steps.set-matrix.outputs['has-backends-darwin'] }}
has-merges-multiarch: ${{ steps.set-matrix.outputs['has-merges-multiarch'] }}
has-merges-singlearch: ${{ steps.set-matrix.outputs['has-merges-singlearch'] }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -27,7 +33,9 @@ jobs:
bun add js-yaml
bun add @octokit/core
# filters the matrix in backend.yml
# filters the matrix in backend.yml; splits into single-arch and
# multi-arch groups so backend-merge-jobs can `needs:` only the latter
# (matches backend.yml's structure).
- name: Filter matrix for changed backends
id: set-matrix
env:
@@ -35,10 +43,10 @@ jobs:
GITHUB_EVENT_PATH: ${{ github.event_path }}
run: bun run scripts/changed-backends.js
backend-jobs:
backend-jobs-multiarch:
needs: generate-matrix
uses: ./.github/workflows/backend_build.yml
if: needs.generate-matrix.outputs.has-backends == 'true'
if: needs.generate-matrix.outputs['has-backends-multiarch'] == 'true'
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
@@ -46,19 +54,83 @@ jobs:
cuda-major-version: ${{ matrix.cuda-major-version }}
cuda-minor-version: ${{ matrix.cuda-minor-version }}
platforms: ${{ matrix.platforms }}
platform-tag: ${{ matrix.platform-tag || '' }}
runs-on: ${{ matrix.runs-on }}
builder-base-image: ${{ matrix.builder-base-image || '' }}
base-image: ${{ matrix.base-image }}
backend: ${{ matrix.backend }}
dockerfile: ${{ matrix.dockerfile }}
skip-drivers: ${{ matrix.skip-drivers }}
context: ${{ matrix.context }}
ubuntu-version: ${{ matrix.ubuntu-version }}
amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: true
matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
max-parallel: 8
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-multiarch']) }}
backend-jobs-singlearch:
needs: generate-matrix
uses: ./.github/workflows/backend_build.yml
if: needs.generate-matrix.outputs['has-backends-singlearch'] == 'true'
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
build-type: ${{ matrix.build-type }}
cuda-major-version: ${{ matrix.cuda-major-version }}
cuda-minor-version: ${{ matrix.cuda-minor-version }}
platforms: ${{ matrix.platforms }}
platform-tag: ${{ matrix.platform-tag || '' }}
runs-on: ${{ matrix.runs-on }}
builder-base-image: ${{ matrix.builder-base-image || '' }}
base-image: ${{ matrix.base-image }}
backend: ${{ matrix.backend }}
dockerfile: ${{ matrix.dockerfile }}
skip-drivers: ${{ matrix.skip-drivers }}
context: ${{ matrix.context }}
ubuntu-version: ${{ matrix.ubuntu-version }}
amdgpu-targets: ${{ matrix.amdgpu-targets || 'gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201' }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: true
max-parallel: 8
matrix: ${{ fromJson(needs.generate-matrix.outputs['matrix-singlearch']) }}
backend-merge-jobs-multiarch:
needs: [generate-matrix, backend-jobs-multiarch]
# backend_merge.yml's push-side steps are all gated on
# github.event_name != 'pull_request', so on a PR the merge job would
# do nothing. Skip it entirely to avoid spinning up an empty runner.
# !cancelled() lets the merge run even when a few build legs fail —
# see the matching note in backend.yml.
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-multiarch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-multiarch']) }}
backend-merge-jobs-singlearch:
needs: [generate-matrix, backend-jobs-singlearch]
if: ${{ !cancelled() && github.event_name != 'pull_request' && needs.generate-matrix.outputs['has-merges-singlearch'] == 'true' }}
uses: ./.github/workflows/backend_merge.yml
with:
tag-latest: ${{ matrix.tag-latest }}
tag-suffix: ${{ matrix.tag-suffix }}
secrets:
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
strategy:
fail-fast: false
matrix: ${{ fromJson(needs.generate-matrix.outputs['merge-matrix-singlearch']) }}
backend-jobs-darwin:
needs: generate-matrix
uses: ./.github/workflows/backend_build_darwin.yml
@@ -66,7 +138,7 @@ jobs:
with:
backend: ${{ matrix.backend }}
build-type: ${{ matrix.build-type }}
go-version: "1.24.x"
go-version: "1.25.x"
tag-suffix: ${{ matrix.tag-suffix }}
lang: ${{ matrix.lang || 'python' }}
use-pip: ${{ matrix.backend == 'diffusers' }}

161
.github/workflows/base-images.yml vendored Normal file
View File

@@ -0,0 +1,161 @@
---
name: 'build base-grpc images'
# Builds + pushes pre-compiled builder base images that downstream
# llama-cpp / ik-llama-cpp / turboquant variant Dockerfiles will FROM
# (PR 2). Each base contains apt deps + protoc + cmake + gRPC at
# /opt/grpc + (conditionally) CUDA / ROCm / Vulkan toolchains.
#
# Triggers:
# - schedule (Saturdays 05:00 UTC) - picks up Ubuntu/CUDA/ROCm
# security updates and re-runs ahead of the backend.yml weekly
# cron (Sundays 06:00 UTC).
# - workflow_dispatch - manual one-off rebuild.
# - push to master that touches Dockerfile.base-grpc-builder or
# this workflow itself - keeps bases in sync with their inputs.
#
# Bootstrap (one-time after this PR merges):
# gh workflow run base-images.yml --ref master
# Wait ~30 min for all 9 matrix variants to push to
# quay.io/go-skynet/ci-cache:base-grpc-* before merging PR 2.
on:
schedule:
- cron: '0 5 * * 6'
workflow_dispatch:
push:
branches: [master]
paths:
- 'backend/Dockerfile.base-grpc-builder'
- '.github/workflows/base-images.yml'
# The install logic and apt-mirror helper are bind-mounted into
# Dockerfile.base-grpc-builder at build time — changes to either
# affect the produced base images and must trigger a rebuild.
- '.docker/install-base-deps.sh'
- '.docker/apt-mirror.sh'
concurrency:
group: ci-base-images-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
build:
if: github.repository == 'mudler/LocalAI'
runs-on: ${{ matrix.runs-on }}
strategy:
fail-fast: false
matrix:
include:
- tag: 'base-grpc-amd64'
runs-on: 'ubuntu-latest'
base-image: 'ubuntu:24.04'
build-type: ''
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
- tag: 'base-grpc-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: 'ubuntu:24.04'
build-type: ''
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
- tag: 'base-grpc-cuda-12-amd64'
runs-on: 'ubuntu-latest'
base-image: 'ubuntu:24.04'
build-type: 'cublas'
cuda-major-version: '12'
cuda-minor-version: '8'
ubuntu-version: '2404'
- tag: 'base-grpc-cuda-13-amd64'
runs-on: 'ubuntu-latest'
base-image: 'ubuntu:22.04'
build-type: 'cublas'
cuda-major-version: '13'
cuda-minor-version: '0'
ubuntu-version: '2204'
- tag: 'base-grpc-cuda-13-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: 'ubuntu:24.04'
build-type: 'cublas'
cuda-major-version: '13'
cuda-minor-version: '0'
ubuntu-version: '2404'
- tag: 'base-grpc-rocm-amd64'
runs-on: 'ubuntu-latest'
base-image: 'rocm/dev-ubuntu-24.04:7.2.1'
build-type: 'hipblas'
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
- tag: 'base-grpc-vulkan-amd64'
runs-on: 'ubuntu-latest'
base-image: 'ubuntu:24.04'
build-type: 'vulkan'
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
- tag: 'base-grpc-vulkan-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: 'ubuntu:24.04'
build-type: 'vulkan'
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
- tag: 'base-grpc-intel-amd64'
runs-on: 'ubuntu-latest'
base-image: 'intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04'
build-type: 'sycl'
cuda-major-version: ''
cuda-minor-version: ''
ubuntu-version: '2404'
# Legacy JetPack r36.4.0 base for older Jetson devices (CUDA 12).
# Distinct from base-grpc-cuda-13-arm64 (Ubuntu 24.04 + CUDA 13 sbsa)
# which targets newer Jetsons. Some matrix entries
# (-nvidia-l4t-arm64-llama-cpp / -turboquant) still build against
# the JetPack image, so we need a matching base.
- tag: 'base-grpc-l4t-cuda-12-arm64'
runs-on: 'ubuntu-24.04-arm'
base-image: 'nvcr.io/nvidia/l4t-jetpack:r36.4.0'
build-type: 'l4t'
cuda-major-version: '12'
cuda-minor-version: '0'
ubuntu-version: '2204'
# JetPack r36.4.0 already ships CUDA preinstalled at /usr/local/cuda;
# apt-installing cuda-nvcc-12-0 from the public repos fails because
# those packages aren't published for the JetPack apt feed. Match
# the original l4t matrix entry which set skip-drivers: 'true'.
skip-drivers: 'true'
steps:
- uses: actions/checkout@v6
with:
submodules: false
- name: Free disk space
uses: ./.github/actions/free-disk-space
- name: Set up build disk
uses: ./.github/actions/setup-build-disk
- uses: docker/setup-qemu-action@master
with:
platforms: all
- uses: docker/setup-buildx-action@master
- uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
- uses: docker/build-push-action@v7
with:
context: .
file: ./backend/Dockerfile.base-grpc-builder
build-args: |
BASE_IMAGE=${{ matrix.base-image }}
BUILD_TYPE=${{ matrix.build-type }}
CUDA_MAJOR_VERSION=${{ matrix.cuda-major-version }}
CUDA_MINOR_VERSION=${{ matrix.cuda-minor-version }}
UBUNTU_VERSION=${{ matrix.ubuntu-version }}
SKIP_DRIVERS=${{ matrix.skip-drivers || 'false' }}
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }}
cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-${{ matrix.tag }},mode=max,ignore-error=true
provenance: false
tags: quay.io/go-skynet/ci-cache:${{ matrix.tag }}
push: true

View File

@@ -50,6 +50,8 @@ jobs:
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Set up Go
uses: actions/setup-go@v5
with:

View File

@@ -22,6 +22,10 @@ jobs:
variable: "TURBOQUANT_VERSION"
branch: "feature/turboquant-kv-cache"
file: "backend/cpp/turboquant/Makefile"
- repository: "antirez/ds4"
variable: "DS4_VERSION"
branch: "main"
file: "backend/cpp/ds4/Makefile"
- repository: "ggml-org/whisper.cpp"
variable: "WHISPER_CPP_VERSION"
branch: "master"
@@ -50,6 +54,10 @@ jobs:
variable: "QWEN3TTS_CPP_VERSION"
branch: "main"
file: "backend/go/qwen3-tts-cpp/Makefile"
- repository: "localai-org/vibevoice.cpp"
variable: "VIBEVOICE_CPP_VERSION"
branch: "master"
file: "backend/go/vibevoice-cpp/Makefile"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
@@ -80,5 +88,37 @@ jobs:
body: ${{ steps.bump.outputs.message }}
signoff: true
bump-vllm-wheel:
# vLLM's cu130 wheel comes from a per-tag index URL (no /latest/ alias),
# so the cublas13 requirements file pins both a URL segment and a version
# constraint. bump_deps.sh handles git-sha-in-Makefile only — this job
# rewrites both values atomically when a new vLLM stable tag ships.
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Bump vLLM cu130 wheel pin 🔧
id: bump
run: |
bash .github/bump_vllm_wheel.sh vllm-project/vllm backend/python/vllm/requirements-cublas13-after.txt VLLM_VERSION
{
echo 'message<<EOF'
cat "VLLM_VERSION_message.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
{
echo 'commit<<EOF'
cat "VLLM_VERSION_commit.txt"
echo EOF
} >> "$GITHUB_OUTPUT"
rm -rfv VLLM_VERSION_message.txt VLLM_VERSION_commit.txt
- name: Create Pull Request
uses: peter-evans/create-pull-request@v8
with:
token: ${{ secrets.UPDATE_BOT_TOKEN }}
push-to-fork: ci-forks/LocalAI
commit-message: ':arrow_up: Update vllm-project/vllm cu130 wheel'
title: 'chore: :arrow_up: Update vllm-project/vllm cu130 wheel to `${{ steps.bump.outputs.commit }}`'
branch: "update/VLLM_VERSION"
body: ${{ steps.bump.outputs.message }}
signoff: true

View File

@@ -8,15 +8,9 @@ jobs:
if: github.repository == 'mudler/LocalAI'
runs-on: ubuntu-latest
steps:
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- uses: actions/checkout@v6
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Install dependencies
run: |
sudo apt-get update

View File

@@ -2,7 +2,7 @@ name: Gallery Agent
on:
schedule:
- cron: '0 */3 * * *' # Run every 4 hours
- cron: '0 */12 * * *' # Run every 4 hours
workflow_dispatch:
inputs:
search_term:

View File

@@ -1,96 +0,0 @@
name: 'generate and publish GRPC docker caches'
on:
workflow_dispatch:
schedule:
# daily at midnight
- cron: '0 0 * * *'
concurrency:
group: grpc-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
jobs:
generate_caches:
if: github.repository == 'mudler/LocalAI'
strategy:
matrix:
include:
- grpc-base-image: ubuntu:24.04
runs-on: 'ubuntu-latest'
platforms: 'linux/amd64,linux/arm64'
runs-on: ${{matrix.runs-on}}
steps:
- name: Release space from worker
if: matrix.runs-on == 'ubuntu-latest'
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get remove -y microsoft-edge-stable || true
sudo apt-get remove -y firefox || true
sudo apt-get remove -y powershell || true
sudo apt-get remove -y r-base-core || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
sudo rm -rf /usr/share/dotnet || true
sudo rm -rf /opt/ghc || true
sudo rm -rf "/usr/local/share/boost" || true
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
df -h
- name: Set up QEMU
uses: docker/setup-qemu-action@master
with:
platforms: all
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@master
- name: Checkout
uses: actions/checkout@v6
- name: Cache GRPC
uses: docker/build-push-action@v7
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
build-args: |
GRPC_BASE_IMAGE=${{ matrix.grpc-base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
context: .
file: ./Dockerfile
cache-to: type=gha,ignore-error=true
cache-from: type=gha
target: grpc
platforms: ${{ matrix.platforms }}
push: false

View File

@@ -7,8 +7,8 @@ on:
- master
concurrency:
group: intel-cache-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: intel-cache-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
generate_caches:
@@ -16,7 +16,7 @@ jobs:
strategy:
matrix:
include:
- base-image: intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04
- base-image: intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04
runs-on: 'arc-runner-set'
platforms: 'linux/amd64'
runs-on: ${{matrix.runs-on}}

View File

@@ -5,8 +5,8 @@
pull_request:
concurrency:
group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
image-build:
@@ -18,9 +18,9 @@
cuda-major-version: ${{ matrix.cuda-major-version }}
cuda-minor-version: ${{ matrix.cuda-minor-version }}
platforms: ${{ matrix.platforms }}
platform-tag: ${{ matrix.platform-tag || '' }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
ubuntu-version: ${{ matrix.ubuntu-version }}
secrets:
@@ -60,27 +60,35 @@
tag-latest: 'false'
tag-suffix: '-hipblas'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
- build-type: 'sycl'
platforms: 'linux/amd64'
tag-latest: 'false'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
grpc-base-image: "ubuntu:24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
tag-suffix: 'sycl'
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
- build-type: 'vulkan'
platforms: 'linux/amd64,linux/arm64'
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'false'
tag-suffix: '-vulkan-core'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
makeflags: "--jobs=4 --output-sync=target"
ubuntu-version: '2404'
- build-type: 'vulkan'
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'false'
tag-suffix: '-vulkan-core'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
makeflags: "--jobs=4 --output-sync=target"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"

View File

@@ -9,8 +9,8 @@
- '*'
concurrency:
group: ci-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
hipblas-jobs:
@@ -25,7 +25,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
ubuntu-version: ${{ matrix.ubuntu-version }}
ubuntu-codename: ${{ matrix.ubuntu-codename }}
@@ -42,12 +41,11 @@
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
grpc-base-image: "ubuntu:24.04"
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
ubuntu-codename: 'noble'
core-image-build:
if: github.repository == 'mudler/LocalAI'
uses: ./.github/workflows/image_build.yml
@@ -58,9 +56,9 @@
cuda-major-version: ${{ matrix.cuda-major-version }}
cuda-minor-version: ${{ matrix.cuda-minor-version }}
platforms: ${{ matrix.platforms }}
platform-tag: ${{ matrix.platform-tag || '' }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
skip-drivers: ${{ matrix.skip-drivers }}
ubuntu-version: ${{ matrix.ubuntu-version }}
@@ -75,7 +73,8 @@
matrix:
include:
- build-type: ''
platforms: 'linux/amd64,linux/arm64'
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: ''
base-image: "ubuntu:24.04"
@@ -84,6 +83,17 @@
skip-drivers: 'false'
ubuntu-version: '2404'
ubuntu-codename: 'noble'
- build-type: ''
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: ''
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
makeflags: "--jobs=4 --output-sync=target"
skip-drivers: 'false'
ubuntu-version: '2404'
ubuntu-codename: 'noble'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -109,7 +119,8 @@
ubuntu-version: '2404'
ubuntu-codename: 'noble'
- build-type: 'vulkan'
platforms: 'linux/amd64,linux/arm64'
platforms: 'linux/amd64'
platform-tag: 'amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan'
runs-on: 'ubuntu-latest'
@@ -118,17 +129,141 @@
makeflags: "--jobs=4 --output-sync=target"
ubuntu-version: '2404'
ubuntu-codename: 'noble'
- build-type: 'vulkan'
platforms: 'linux/arm64'
platform-tag: 'arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan'
runs-on: 'ubuntu-24.04-arm'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
makeflags: "--jobs=4 --output-sync=target"
ubuntu-version: '2404'
ubuntu-codename: 'noble'
- build-type: 'intel'
platforms: 'linux/amd64'
tag-latest: 'auto'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
grpc-base-image: "ubuntu:24.04"
base-image: "intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04"
tag-suffix: '-gpu-intel'
runs-on: 'ubuntu-latest'
makeflags: "--jobs=3 --output-sync=target"
ubuntu-version: '2404'
ubuntu-codename: 'noble'
core-image-merge:
# !cancelled(): without it, GHA's default `needs:` cascade skips the
# merge whenever any matrix cell of the parent build fails or is
# cancelled. Same fix as backend.yml's merge jobs — we still want to
# publish the manifest list for tag-suffixes whose legs all succeeded.
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: ''
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-vulkan-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
# Single-arch server-image merges. Same conceptual fix as the backend
# singletons in PR #9781: image_build.yml pushes by canonical digest
# only, so without a downstream merge step there's no tag for consumers
# (no :latest-gpu-nvidia-cuda-12, no :v<X>-gpu-nvidia-cuda-12, etc.).
# Each merge job needs only its parent build matrix and is filtered by
# tag-suffix in image_merge.yml's artifact-download pattern.
gpu-nvidia-cuda-12-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-nvidia-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-intel-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: core-image-build
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-intel'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gpu-hipblas-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: hipblas-jobs
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-gpu-hipblas'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
nvidia-l4t-arm64-cuda-13-image-merge:
if: ${{ !cancelled() && github.repository == 'mudler/LocalAI' }}
needs: gh-runner
uses: ./.github/workflows/image_merge.yml
with:
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-cuda-13'
secrets:
dockerUsername: ${{ secrets.DOCKERHUB_USERNAME }}
dockerPassword: ${{ secrets.DOCKERHUB_PASSWORD }}
quayUsername: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
quayPassword: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
gh-runner:
if: github.repository == 'mudler/LocalAI'
uses: ./.github/workflows/image_build.yml
@@ -141,7 +276,6 @@
platforms: ${{ matrix.platforms }}
runs-on: ${{ matrix.runs-on }}
base-image: ${{ matrix.base-image }}
grpc-base-image: ${{ matrix.grpc-base-image }}
makeflags: ${{ matrix.makeflags }}
skip-drivers: ${{ matrix.skip-drivers }}
ubuntu-version: ${{ matrix.ubuntu-version }}

View File

@@ -8,11 +8,6 @@ on:
description: 'Base image'
required: true
type: string
grpc-base-image:
description: 'GRPC Base image, must be a compatible image with base-image'
required: false
default: ''
type: string
build-type:
description: 'Build type'
default: ''
@@ -29,6 +24,15 @@ on:
description: 'Platforms'
default: ''
type: string
platform-tag:
description: |
Short tag identifying the platform leg, e.g. "amd64" or "arm64".
Used to scope the per-arch registry cache and the digest artifact name.
Optional during the migration; will be flipped to required: true once
every caller passes an explicit value.
required: false
default: ''
type: string
tag-latest:
description: 'Tag latest'
default: ''
@@ -75,73 +79,20 @@ jobs:
runs-on: ${{ inputs.runs-on }}
steps:
- name: Free Disk Space (Ubuntu)
if: inputs.runs-on == 'ubuntu-latest'
uses: jlumbroso/free-disk-space@main
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
tool-cache: true
# all of these default to true, but feel free to set to
# "false" if necessary for your workflow
android: true
dotnet: true
haskell: true
large-packages: true
docker-images: true
swap-storage: true
- name: Force Install GIT latest
run: |
sudo apt-get update \
&& sudo apt-get install -y software-properties-common \
&& sudo apt-get update \
&& sudo add-apt-repository -y ppa:git-core/ppa \
&& sudo apt-get update \
&& sudo apt-get install -y git
- name: Checkout
uses: actions/checkout@v6
- name: Release space from worker
if: inputs.runs-on == 'ubuntu-latest'
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools snapd || true
sudo apt-get purge --auto-remove android-sdk-platform-tools snapd || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get remove -y microsoft-edge-stable || true
sudo apt-get remove -y firefox || true
sudo apt-get remove -y powershell || true
sudo apt-get remove -y r-base-core || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
sudo rm -rf /usr/share/dotnet || true
sudo rm -rf /opt/ghc || true
sudo rm -rf "/usr/local/share/boost" || true
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
df -h
- name: Configure apt mirror on runner
id: apt_mirror
uses: ./.github/actions/configure-apt-mirror
- name: Free disk space
uses: ./.github/actions/free-disk-space
with:
mode: ${{ inputs.runs-on == 'ubuntu-latest' && 'hosted' || 'skip' }}
- name: Set up build disk
uses: ./.github/actions/setup-build-disk
- name: Docker meta
id: meta
@@ -196,59 +147,89 @@ jobs:
username: ${{ secrets.quayUsername }}
password: ${{ secrets.quayPassword }}
- name: Build and push
- name: Build and push by digest
id: build
uses: docker/build-push-action@v7
if: github.event_name != 'pull_request'
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
# This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
build-args: |
BUILD_TYPE=${{ inputs.build-type }}
CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
BASE_IMAGE=${{ inputs.base-image }}
GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
MAKEFLAGS=${{ inputs.makeflags }}
SKIP_DRIVERS=${{ inputs.skip-drivers }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
context: .
file: ./Dockerfile
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
cache-to: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }},mode=max,ignore-error=true
platforms: ${{ inputs.platforms }}
push: ${{ github.event_name != 'pull_request' }}
tags: ${{ steps.meta.outputs.tags }}
outputs: |
type=image,name=quay.io/go-skynet/local-ai,push-by-digest=true,name-canonical=true,push=true
type=image,name=localai/localai,push-by-digest=true,name-canonical=true,push=true
# See backend_build.yml for the rationale — provenance=mode=max
# diverges the manifest-list digest per registry, breaking the
# downstream imagetools create lookup.
provenance: false
labels: ${{ steps.meta.outputs.labels }}
- name: Export digest
if: github.event_name != 'pull_request'
run: |
mkdir -p /tmp/digests
digest="${{ steps.build.outputs.digest }}"
touch "/tmp/digests/${digest#sha256:}"
# See .github/scripts/anchor-digest-in-cache.sh for why this is needed
# and how it interacts with image_merge.yml's cleanup step. Mirrors the
# same anchor in backend_build.yml — quay's per-repo manifest GC reaps
# untagged manifests in local-ai before the merge runs.
- name: Anchor digest in ci-cache so quay GC won't reap before merge
if: github.event_name != 'pull_request'
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
PLATFORM_TAG: ${{ inputs.platform-tag || 'single' }}
DIGEST: ${{ steps.build.outputs.digest }}
SOURCE_IMAGE: quay.io/go-skynet/local-ai
run: .github/scripts/anchor-digest-in-cache.sh
- name: Upload digest artifact
if: github.event_name != 'pull_request'
uses: actions/upload-artifact@v7
with:
# `--` separator + 'single' placeholder for empty platform-tag —
# same pattern as backend_build.yml. Prevents prefix collisions
# in the merge-side glob (e.g. -nvidia-l4t-arm64 is a prefix of
# -nvidia-l4t-arm64-cuda-13).
name: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--${{ inputs.platform-tag || 'single' }}
path: /tmp/digests/*
if-no-files-found: error
retention-days: 1
### Start testing image
- name: Build and push
uses: docker/build-push-action@v7
if: github.event_name == 'pull_request'
with:
builder: ${{ steps.buildx.outputs.name }}
# The build-args MUST be an EXACT match between the image cache and other workflow steps that want to use that cache.
# This means that even the MAKEFLAGS have to be an EXACT match.
# If the build-args are not an EXACT match, it will result in a cache miss, which will require GRPC to be built from scratch.
# This is why some build args like GRPC_VERSION and MAKEFLAGS are hardcoded
build-args: |
BUILD_TYPE=${{ inputs.build-type }}
CUDA_MAJOR_VERSION=${{ inputs.cuda-major-version }}
CUDA_MINOR_VERSION=${{ inputs.cuda-minor-version }}
BASE_IMAGE=${{ inputs.base-image }}
GRPC_BASE_IMAGE=${{ inputs.grpc-base-image || inputs.base-image }}
GRPC_MAKEFLAGS=--jobs=4 --output-sync=target
GRPC_VERSION=v1.65.0
MAKEFLAGS=${{ inputs.makeflags }}
SKIP_DRIVERS=${{ inputs.skip-drivers }}
UBUNTU_VERSION=${{ inputs.ubuntu-version }}
UBUNTU_CODENAME=${{ inputs.ubuntu-codename }}
APT_MIRROR=${{ steps.apt_mirror.outputs.effective-mirror }}
APT_PORTS_MIRROR=${{ steps.apt_mirror.outputs.effective-ports-mirror }}
context: .
file: ./Dockerfile
cache-from: type=gha
cache-from: type=registry,ref=quay.io/go-skynet/ci-cache:cache-localai${{ inputs.tag-suffix }}-${{ inputs.platform-tag }}
platforms: ${{ inputs.platforms }}
#push: true
tags: ${{ steps.meta_pull_request.outputs.tags }}

145
.github/workflows/image_merge.yml vendored Normal file
View File

@@ -0,0 +1,145 @@
---
name: 'merge LocalAI image manifest list (reusable)'
# Reusable workflow that joins per-arch digest artifacts (uploaded by
# image_build.yml when called with platform-tag) into a single tagged
# multi-arch manifest list.
on:
workflow_call:
inputs:
tag-latest:
description: 'Whether the manifest list should also be tagged latest (auto/false/true)'
required: false
type: string
default: ''
tag-suffix:
description: 'Image tag suffix (empty for core image). Used in artifact pattern with a -core placeholder for empty.'
required: true
type: string
secrets:
dockerUsername:
required: false
dockerPassword:
required: false
quayUsername:
required: true
quayPassword:
required: true
jobs:
merge:
runs-on: ubuntu-latest
env:
quay_username: ${{ secrets.quayUsername }}
steps:
# Sparse checkout: needed for .github/scripts/ (the keepalive cleanup
# script). Skips the rest of the source tree.
- name: Checkout (.github/scripts only)
uses: actions/checkout@v6
with:
sparse-checkout: |
.github/scripts
sparse-checkout-cone-mode: false
- name: Download digests
uses: actions/download-artifact@v8
with:
# `--` separator anchors the glob so we don't over-match sibling
# tag-suffixes (e.g. -nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13).
# Must stay in sync with image_build.yml's upload-artifact name.
pattern: digests-localai${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}--*
merge-multiple: true
path: /tmp/digests
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@master
- name: Login to DockerHub
if: github.event_name != 'pull_request'
uses: docker/login-action@v4
with:
username: ${{ secrets.dockerUsername }}
password: ${{ secrets.dockerPassword }}
- name: Login to Quay.io
uses: docker/login-action@v4
with:
registry: quay.io
username: ${{ secrets.quayUsername }}
password: ${{ secrets.quayPassword }}
- name: Docker meta
id: meta
uses: docker/metadata-action@v6
with:
images: |
quay.io/go-skynet/local-ai
localai/localai
tags: |
type=ref,event=branch
type=semver,pattern={{raw}}
type=sha
flavor: |
latest=${{ inputs.tag-latest }}
suffix=${{ inputs.tag-suffix }},onlatest=true
# Source from ci-cache, not local-ai. See backend_merge.yml for the
# detailed rationale — quay's manifest GC is per-repository, so the
# untagged digest in local-ai gets reaped while the same content lives
# tagged under ci-cache (anchored by image_build.yml). buildx imagetools
# create copies the manifest into local-ai (blobs already cross-mounted)
# and publishes the manifest list with user-facing tags. End state in
# local-ai is self-contained; no embedded reference to ci-cache.
- name: Create manifest list and push (quay)
working-directory: /tmp/digests
run: |
set -euo pipefail
tags=$(jq -cr '.tags | map(select(startswith("quay.io/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No quay.io tags from docker/metadata-action; skipping quay merge"
else
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'quay.io/go-skynet/ci-cache@sha256:%s ' *)
fi
- name: Create manifest list and push (dockerhub)
if: github.event_name != 'pull_request'
working-directory: /tmp/digests
run: |
set -euo pipefail
tags=$(jq -cr '.tags | map(select(startswith("localai/"))) | map("-t " + .) | join(" ")' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -z "$tags" ]; then
echo "No dockerhub tags from docker/metadata-action; skipping dockerhub merge"
else
# shellcheck disable=SC2086
docker buildx imagetools create $tags \
$(printf 'localai/localai@sha256:%s ' *)
fi
- name: Inspect manifest
run: |
set -euo pipefail
first_tag=$(jq -cr '.tags[0]' <<< "$DOCKER_METADATA_OUTPUT_JSON")
if [ -n "$first_tag" ] && [ "$first_tag" != "null" ]; then
docker buildx imagetools inspect "$first_tag"
fi
# See .github/scripts/cleanup-keepalive-tags.sh for the best-effort
# semantics — fails soft when the registry credential isn't OAuth-scoped.
- name: Cleanup keepalive tags in ci-cache
if: github.event_name != 'pull_request' && success()
env:
TAG_SUFFIX: ${{ inputs.tag-suffix == '' && '-core' || inputs.tag-suffix }}
QUAY_TOKEN: ${{ secrets.quayPassword }}
run: .github/scripts/cleanup-keepalive-tags.sh
- name: Job summary
run: |
set -euo pipefail
echo "Merged manifest tags:" >> "$GITHUB_STEP_SUMMARY"
jq -r '.tags[]' <<< "$DOCKER_METADATA_OUTPUT_JSON" | sed 's/^/- /' >> "$GITHUB_STEP_SUMMARY"
echo >> "$GITHUB_STEP_SUMMARY"
echo "Per-arch digests:" >> "$GITHUB_STEP_SUMMARY"
ls -1 /tmp/digests | sed 's/^/- sha256:/' >> "$GITHUB_STEP_SUMMARY"

48
.github/workflows/lint.yml vendored Normal file
View File

@@ -0,0 +1,48 @@
---
name: 'lint'
on:
pull_request:
paths-ignore:
- 'docs/**'
- 'examples/**'
- 'README.md'
- '**/*.md'
push:
branches:
- master
concurrency:
group: ci-lint-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
golangci-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
# Full history so golangci-lint's new-from-merge-base can reach
# origin/master and compute the diff against it.
fetch-depth: 0
- uses: actions/setup-go@v5
with:
go-version: '1.26.x'
cache: false
- name: install golangci-lint
run: |
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh \
| sh -s -- -b "$(go env GOPATH)/bin" v2.11.4
- name: generate grpc proto sources
# pkg/grpc/proto/*.go is generated, not checked in. Several packages
# import it, so without this step typecheck fails project-wide.
run: make protogen-go
- name: stub react-ui dist for go:embed
# core/http/app.go has //go:embed react-ui/dist/*; the glob needs at
# least one non-hidden entry to satisfy typecheck. We don't run
# `make react-ui` here because lint doesn't need the real bundle.
run: |
mkdir -p core/http/react-ui/dist
touch core/http/react-ui/dist/index.html
- name: lint
run: make lint

View File

@@ -49,6 +49,8 @@ jobs:
uses: actions/checkout@v6
with:
fetch-depth: 0
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Set up Go
uses: actions/setup-go@v5
with:

View File

@@ -10,8 +10,8 @@ on:
- '*'
concurrency:
group: ci-tests-extra-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-tests-extra-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
detect-changes:
@@ -28,6 +28,7 @@ jobs:
qwen-asr: ${{ steps.detect.outputs.qwen-asr }}
nemo: ${{ steps.detect.outputs.nemo }}
voxcpm: ${{ steps.detect.outputs.voxcpm }}
liquid-audio: ${{ steps.detect.outputs.liquid-audio }}
llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
@@ -36,8 +37,14 @@ jobs:
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
localvqe: ${{ steps.detect.outputs.localvqe }}
voxtral: ${{ steps.detect.outputs.voxtral }}
kokoros: ${{ steps.detect.outputs.kokoros }}
insightface: ${{ steps.detect.outputs.insightface }}
speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
whisper: ${{ steps.detect.outputs.whisper }}
steps:
- name: Checkout repository
uses: actions/checkout@v6
@@ -441,6 +448,32 @@ jobs:
run: |
make --jobs=5 --output-sync=target -C backend/python/voxcpm
make --jobs=5 --output-sync=target -C backend/python/voxcpm test
# liquid-audio: LFM2.5-Audio any-to-any backend. The CI smoke test
# exercises Health() and LoadModel(mode:finetune) — fine-tune mode
# short-circuits before pulling weights (backend.py:192), so no
# HuggingFace download or GPU is needed. The full-inference path is
# gated on LIQUID_AUDIO_MODEL_ID, which we don't set here.
tests-liquid-audio:
needs: detect-changes
if: needs.detect-changes.outputs.liquid-audio == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential ffmpeg
sudo apt-get install -y ca-certificates cmake curl patch python3-pip
# Install UV
curl -LsSf https://astral.sh/uv/install.sh | sh
pip install --user --no-cache-dir grpcio-tools==1.64.1
- name: Test liquid-audio
run: |
make --jobs=5 --output-sync=target -C backend/python/liquid-audio
make --jobs=5 --output-sync=target -C backend/python/liquid-audio test
tests-llama-cpp-quantization:
needs: detect-changes
if: needs.detect-changes.outputs.llama-cpp-quantization == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -504,6 +537,120 @@ jobs:
- name: Build llama-cpp backend image and run audio transcription gRPC e2e tests
run: |
make test-extra-backend-llama-cpp-transcription
# PR-acceptance smoke gate: always runs on every PR (no detect-changes gate, no
# paths filter). Pulls the pre-built master CPU llama-cpp image from quay
# instead of building from source, so the cost is a docker pull (~30s) plus the
# short Qwen3-0.6B model download. Exercises the full gRPC surface — health,
# load, predict, stream — plus the logprobs/logit_bias specs that moved out of
# core/http/app_test.go. Anything heavier or per-backend is gated to the
# detect-changes path-filter above.
tests-llama-cpp-smoke:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Pull pre-built llama-cpp backend image
run: docker pull quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
- name: Run e2e-backends smoke
env:
BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
run: |
make test-extra-backend
# Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
# Builds the sherpa-onnx Docker image, extracts the rootfs so the e2e suite
# can discover the backend binary + shared libs, downloads the three model
# bundles (silero-vad, omnilingual-asr, vits-ljs) and drives the realtime
# websocket spec end-to-end.
tests-sherpa-onnx-realtime:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: '22'
- name: Build sherpa-onnx backend image and run realtime e2e tests
run: |
make test-extra-e2e-realtime-sherpa
# Streaming ASR via the sherpa-onnx online recognizer (zipformer
# transducer). Exercises both AudioTranscription (buffered) and
# AudioTranscriptionStream (real-time deltas) on the e2e-backends
# harness.
tests-sherpa-onnx-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run streaming ASR gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-transcription
# End-to-end transcription via the e2e-backends gRPC harness against
# the whisper.cpp backend. Drives AudioTranscription (offline) and
# AudioTranscriptionStream (real, segment-callback-driven deltas) on
# ggml-base.en + the JFK 11s clip.
tests-whisper-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.whisper == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build whisper backend image and run transcription gRPC e2e tests
run: |
make test-extra-backend-whisper-transcription
# VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
# TTSStream (PCM chunks) on the e2e-backends harness.
tests-sherpa-onnx-grpc-tts:
needs: detect-changes
if: needs.detect-changes.outputs.sherpa-onnx == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build sherpa-onnx backend image and run TTS gRPC e2e tests
run: |
make test-extra-backend-sherpa-onnx-tts
tests-ik-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.ik-llama-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -696,6 +843,117 @@ jobs:
- name: Test qwen3-tts-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
# Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
# runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
# the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
# + tokenizer + voice) and runs the closed-loop TTS → ASR Go test.
tests-vibevoice-cpp:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y build-essential cmake curl libopenblas-dev ffmpeg
- name: Setup Go
uses: actions/setup-go@v5
- name: Display Go version
run: go version
- name: Proto Dependencies
run: |
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Build vibevoice-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp
- name: Test vibevoice-cpp
run: |
make --jobs=5 --output-sync=target -C backend/go/vibevoice-cpp test
# End-to-end TTS via the e2e-backends gRPC harness. Builds the
# vibevoice-cpp Docker image and drives Backend/TTS against it with a
# real LocalAI gRPC client.
tests-vibevoice-cpp-grpc-tts:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build vibevoice-cpp backend image and run TTS gRPC e2e tests
run: |
make test-extra-backend-vibevoice-cpp-tts
# End-to-end transcription via the e2e-backends gRPC harness. The
# vibevoice ASR is a 7B-param model (Q4_K weights ~10 GB on disk)
# and the JFK 30 s decode is too heavy for a free 4-core
# ubuntu-latest pool runner - two CI attempts got SIGTERM'd during
# LoadModel, before the test could even progress. Use the
# self-hosted 'bigger-runner' label (same one the GPU image builds
# in backend.yml use) and the documented dotnet/ghc/android cache
# purge to clear ~10-20 GB of headroom for the model + Docker
# image + working dir.
tests-vibevoice-cpp-grpc-transcription:
needs: detect-changes
if: needs.detect-changes.outputs.vibevoice-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: bigger-runner
timeout-minutes: 150
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl unzip ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build vibevoice-cpp backend image and run ASR gRPC e2e tests
run: |
make test-extra-backend-vibevoice-cpp-transcription
# End-to-end audio transform via the e2e-backends gRPC harness. The
# LocalVQE GGUF is small (~5 MB) and the model is real-time on CPU, so
# the default ubuntu-latest pool is plenty.
tests-localvqe-grpc-transform:
needs: detect-changes
if: needs.detect-changes.outputs.localvqe == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 60
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
- name: Build localvqe backend image and run audio_transform gRPC e2e tests
run: |
make test-extra-backend-localvqe-transform
tests-voxtral:
needs: detect-changes
if: needs.detect-changes.outputs.voxtral == 'true' || needs.detect-changes.outputs.run-all == 'true'
@@ -751,3 +1009,55 @@ jobs:
- name: Test kokoros
run: |
make -C backend/rust/kokoros test
tests-insightface-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.insightface == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl unzip ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26.0'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build insightface backend image and run both model configurations
run: |
make test-extra-backend-insightface-all
tests-speaker-recognition-grpc:
needs: detect-changes
if: needs.detect-changes.outputs.speaker-recognition == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
make build-essential curl ca-certificates git tar
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.26.0'
- name: Free disk space
run: |
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /opt/hostedtoolcache/CodeQL || true
df -h
- name: Build speaker-recognition backend image and run the ECAPA-TDNN configuration
run: |
make test-extra-backend-speaker-recognition-all

View File

@@ -9,12 +9,9 @@ on:
tags:
- '*'
env:
GRPC_VERSION: v1.65.0
concurrency:
group: ci-tests-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-tests-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
tests-linux:
@@ -23,56 +20,12 @@ jobs:
matrix:
go-version: ['1.26.x']
steps:
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/free-disk-space@main
with:
# this might remove tools that are actually needed,
# if set to "true" but frees about 6 GB
tool-cache: true
# all of these default to true, but feel free to set to
# "false" if necessary for your workflow
android: true
dotnet: true
haskell: true
large-packages: true
docker-images: true
swap-storage: true
- name: Release space from worker
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
df -h
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Free disk space
uses: ./.github/actions/free-disk-space
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:
@@ -100,73 +53,9 @@ jobs:
node-version: '22'
- name: Build React UI
run: make react-ui
- name: Build backends
run: |
make backends/transformers
mkdir external && mv backends/transformers external/transformers
make backends/llama-cpp backends/local-store backends/silero-vad backends/piper backends/whisper backends/stablediffusion-ggml
- name: Test
run: |
TRANSFORMER_BACKEND=$PWD/external/transformers/run.sh PATH="$PATH:/root/go/bin" GO_TAGS="tts" make --jobs 5 --output-sync=target test
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
with:
detached: true
connect-timeout-seconds: 180
limit-access-to-actor: true
tests-e2e-container:
runs-on: ubuntu-latest
steps:
- name: Release space from worker
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
df -h
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Test
run: |
PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
@@ -195,7 +84,7 @@ jobs:
run: go version
- name: Dependencies
run: |
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus
brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm opus ffmpeg
pip install --user --no-cache-dir grpcio-tools grpcio
- name: Setup Node.js
uses: actions/setup-node@v6
@@ -203,10 +92,6 @@ jobs:
node-version: '22'
- name: Build React UI
run: make react-ui
- name: Build llama-cpp-darwin
run: |
make protogen-go
make backends/llama-cpp-darwin
- name: Test
run: |
export C_INCLUDE_PATH=/usr/local/include

86
.github/workflows/tests-aio.yml vendored Normal file
View File

@@ -0,0 +1,86 @@
---
name: 'tests-aio'
# Runs the all-in-one (AIO) Docker image with real backends + real models.
# Heavy: builds llama-cpp/whisper/piper/silero-vad/stablediffusion-ggml/local-store
# and exercises end-to-end inference inside the container. Moved out of test.yml
# (which used to run on every PR) so PR CI no longer pays this cost.
#
# Triggers:
# - schedule (nightly @ 04:00 UTC) — catches packaging/image regressions within 24h
# - workflow_dispatch — manual run on-demand
# - push to master/tags — sanity check after merge / before release
on:
schedule:
- cron: '0 4 * * *'
workflow_dispatch:
push:
branches:
- master
tags:
- '*'
concurrency:
group: ci-tests-aio-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
tests-aio:
runs-on: ubuntu-latest
steps:
- name: Release space from worker
run: |
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
df -h
echo
sudo apt-get remove -y '^llvm-.*|^libllvm.*' || true
sudo apt-get remove --auto-remove android-sdk-platform-tools || true
sudo apt-get purge --auto-remove android-sdk-platform-tools || true
sudo rm -rf /usr/local/lib/android
sudo apt-get remove -y '^dotnet-.*|^aspnetcore-.*' || true
sudo rm -rf /usr/share/dotnet
sudo apt-get remove -y '^mono-.*' || true
sudo apt-get remove -y '^ghc-.*' || true
sudo apt-get remove -y '.*jdk.*|.*jre.*' || true
sudo apt-get remove -y 'php.*' || true
sudo apt-get remove -y hhvm powershell firefox monodoc-manual msbuild || true
sudo apt-get remove -y '^google-.*' || true
sudo apt-get remove -y azure-cli || true
sudo apt-get remove -y '^mongo.*-.*|^postgresql-.*|^mysql-.*|^mssql-.*' || true
sudo apt-get remove -y '^gfortran-.*' || true
sudo apt-get autoremove -y
sudo apt-get clean
echo
echo "Listing top largest packages"
pkgs=$(dpkg-query -Wf '${Installed-Size}\t${Package}\t${Status}\n' | awk '$NF == "installed"{print $1 "\t" $2}' | sort -nr)
head -n 30 <<< "${pkgs}"
echo
sudo rm -rfv build || true
df -h
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Dependencies
run: |
# Install protoc
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
PATH="$PATH:$HOME/go/bin" make protogen-go
- name: Test
run: |
PATH="$PATH:$HOME/go/bin" make backends/local-store backends/silero-vad backends/llama-cpp backends/whisper backends/piper backends/stablediffusion-ggml docker-build-e2e e2e-aio
- name: Setup tmate session if tests fail
if: ${{ failure() }}
uses: mxschmitt/action-tmate@v3.23
with:
detached: true
connect-timeout-seconds: 180
limit-access-to-actor: true

View File

@@ -10,8 +10,8 @@ on:
- '*'
concurrency:
group: ci-tests-e2e-backend-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-tests-e2e-backend-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
tests-e2e-backend:
@@ -24,6 +24,8 @@ jobs:
uses: actions/checkout@v6
with:
submodules: true
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:

View File

@@ -12,8 +12,8 @@ on:
- master
concurrency:
group: ci-tests-ui-e2e-${{ github.head_ref || github.ref }}-${{ github.repository }}
cancel-in-progress: true
group: ci-tests-ui-e2e-${{ github.event.pull_request.number || github.sha }}-${{ github.repository }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
tests-ui-e2e:
@@ -26,6 +26,8 @@ jobs:
uses: actions/checkout@v6
with:
submodules: true
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- name: Setup Go ${{ matrix.go-version }}
uses: actions/setup-go@v5
with:

View File

@@ -11,6 +11,8 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Configure apt mirror on runner
uses: ./.github/actions/configure-apt-mirror
- uses: actions/setup-go@v5
with:
go-version: 'stable'

97
.golangci.yml Normal file
View File

@@ -0,0 +1,97 @@
version: "2"
# Only issues introduced relative to master are reported. Pre-existing issues
# in the codebase do not fail the lint job; they're treated as a baseline that
# can be cleaned up incrementally. New code (added lines on a branch) is held
# to the full linter set. Locally, `make lint-all` overrides this and reports
# every issue.
issues:
# origin/master because in shallow CI checkouts only the remote-tracking
# branch exists; a bare 'master' ref isn't reachable locally.
new-from-merge-base: origin/master
linters:
default: standard
# staticcheck is noisy on this codebase (mostly QF style suggestions like
# "could use tagged switch" or "unnecessary fmt.Sprintf"). Re-enable
# selectively if a high-signal subset is identified.
disable:
- staticcheck
enable:
- forbidigo
settings:
forbidigo:
forbid:
- pattern: '^t\.Errorf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Errorf. See .agents/coding-style.md.'
- pattern: '^t\.Error$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(...) instead of t.Error. See .agents/coding-style.md.'
- pattern: '^t\.Fatalf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatalf. See .agents/coding-style.md.'
- pattern: '^t\.Fatal$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Expect(...).To(Succeed()) / Fail(...) instead of t.Fatal. See .agents/coding-style.md.'
- pattern: '^t\.Run$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Describe/Context/It instead of t.Run. See .agents/coding-style.md.'
- pattern: '^t\.Skip$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skip. See .agents/coding-style.md.'
- pattern: '^t\.Skipf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.Skipf. See .agents/coding-style.md.'
- pattern: '^t\.SkipNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Skip(...) instead of t.SkipNow. See .agents/coding-style.md.'
- pattern: '^t\.Logf$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintf(GinkgoWriter, ...) instead of t.Logf. See .agents/coding-style.md.'
- pattern: '^t\.Log$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use GinkgoWriter / fmt.Fprintln(GinkgoWriter, ...) instead of t.Log. See .agents/coding-style.md.'
- pattern: '^t\.Fail$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
- pattern: '^t\.FailNow$'
msg: 'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
# In-process config should flow through ApplicationConfig / kong-bound
# CLI flags, not via os.Getenv. The CLI layer is the legitimate
# env→struct boundary (kong's `env:"..."` tag); anything deeper that
# reads env directly leaks process state into business logic and
# makes flags impossible to test or override per-request. Backend
# subprocesses, the system/capabilities probe, and a few places that
# read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
# are exempt — see linters.exclusions.rules below.
- pattern: '^os\.(Getenv|LookupEnv|Environ)$'
msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
exclusions:
paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
- 'backend/go/whisper/sources'
- 'docs/'
rules:
# CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
# boundary, and a handful of subcommands legitimately propagate values
# to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
- path: ^core/cli/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Backend subprocesses are independent binaries with their own env
# surface; they're not "in-process config" of the LocalAI server.
- path: ^backend/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# System capability probe reads HOME, PATH-style vars to discover
# GPUs, default paths, etc. — not LocalAI config.
- path: ^pkg/system/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
# time; model.Loader sets/inherits env to communicate with subprocesses.
- path: ^pkg/grpc/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
- path: ^pkg/model/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Top-level main binaries (local-ai, launcher) are entry points.
- path: ^cmd/
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]
# Tests legitimately read $HOME, $TMPDIR, and gating env vars
# (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
- path: _test\.go$
text: 'os\.(Getenv|LookupEnv|Environ)'
linters: [forbidigo]

View File

@@ -19,14 +19,19 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
|------|-------------|
| [.agents/ai-coding-assistants.md](.agents/ai-coding-assistants.md) | Policy for AI-assisted contributions — licensing, DCO, attribution |
| [.agents/building-and-testing.md](.agents/building-and-testing.md) | Building the project, running tests, Docker builds for specific platforms |
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist |
| [.agents/ci-caching.md](.agents/ci-caching.md) | CI build cache layout (registry-backed BuildKit cache on quay.io/go-skynet/ci-cache, per-arch keys), `DEPS_REFRESH` weekly cache-buster for unpinned Python deps, prebuilt `base-grpc-*` images for llama.cpp variants, per-arch native + manifest-merge pattern, `setup-build-disk` `/mnt` relocation, path filter on master push, manual eviction |
| [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
| [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
| [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
| [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
| [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
| [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
| [.agents/testing-mcp-apps.md](.agents/testing-mcp-apps.md) | Testing MCP Apps (interactive tool UIs) in the React UI |
| [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) | Adding API endpoints, auth middleware, feature permissions, user access control |
| [.agents/debugging-backends.md](.agents/debugging-backends.md) | Debugging runtime backend failures, dependency conflicts, rebuilding backends |
| [.agents/adding-gallery-models.md](.agents/adding-gallery-models.md) | Adding GGUF models from HuggingFace to the model gallery |
| [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) | LocalAI Assistant chat modality — adding admin tools to the in-process MCP server, editing skill prompts, keeping REST + MCP + skills in sync |
| [.agents/backend-signing.md](.agents/backend-signing.md) | Backend OCI image signing (keyless cosign + sigstore-go) — producer-side CI setup, consumer-side gallery `verification:` block, strict mode (`LOCALAI_REQUIRE_BACKEND_INTEGRITY`), revocation via `not_before` |
## Quick Reference
@@ -34,5 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
- **Go style**: Prefer `any` over `interface{}`
- **Comments**: Explain *why*, not *what*
- **Docs**: Update `docs/content/` when adding features or changing config
- **New API endpoints**: LocalAI advertises its capability surface in several independent places — swagger `@Tags`, `/api/instructions` registry, auth `RouteFeatureRegistry`, React UI `capabilities.js`, docs. Read [.agents/api-endpoints-and-auth.md](.agents/api-endpoints-and-auth.md) and follow its checklist — missing any surface means clients, admins, and the UI won't know the endpoint exists.
- **Admin endpoints → MCP tool**: every admin endpoint that an admin would manage conversationally (install/list/edit/toggle/upgrade) MUST also be exposed as an MCP tool in `pkg/mcp/localaitools/`. The LocalAI Assistant chat modality and the standalone `local-ai mcp-server` consume that package; drift between REST and MCP is a real risk. Read [.agents/localai-assistant-mcp.md](.agents/localai-assistant-mcp.md) — the `TestToolHTTPRouteMappingComplete` test fails until you wire the new tool and update the route map.
- **Build**: Inspect `Makefile` and `.github/workflows/` — ask the user before running long builds
- **UI**: The active UI is the React app in `core/http/react-ui/`. The older Alpine.js/HTML UI in `core/http/static/` is pending deprecation — all new UI work goes in the React UI

View File

@@ -1,13 +1,20 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
ARG INTEL_BASE_IMAGE=${BASE_IMAGE}
ARG UBUNTU_CODENAME=noble
# Optional alternate Ubuntu apt mirror(s). Empty = use upstream.
# See .docker/apt-mirror.sh for accepted values.
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS requirements
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates curl wget espeak-ng libgomp1 \
ffmpeg libopenblas0 libopenblas-dev libopus0 sox && \
@@ -149,6 +156,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
@@ -240,10 +248,14 @@ WORKDIR /build
# This is a temporary workaround until Intel fixes their repository
FROM ${INTEL_BASE_IMAGE} AS intel
ARG UBUNTU_CODENAME=noble
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
RUN echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu ${UBUNTU_CODENAME}/lts/2350 unified" > /etc/apt/sources.list.d/intel-graphics.list
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
intel-oneapi-runtime-libs && \
apt-get clean && \
@@ -293,7 +305,7 @@ EOT
###################################
# Build React UI
FROM node:25-slim AS react-ui-builder
FROM node:26-slim AS react-ui-builder
WORKDIR /app
COPY core/http/react-ui/package*.json ./
RUN npm install

450
Makefile
View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
GOCMD=go
GOTEST=$(GOCMD) test
@@ -10,6 +10,13 @@ LAUNCHER_BINARY_NAME=local-ai-launcher
UBUNTU_VERSION?=2404
UBUNTU_CODENAME?=noble
# Optional Ubuntu apt mirror overrides forwarded to docker builds.
# Empty = use upstream archive.ubuntu.com / security.ubuntu.com / ports.ubuntu.com.
# Set e.g. APT_MIRROR=http://azure.archive.ubuntu.com to route apt traffic
# during outages of the default Ubuntu pool.
APT_MIRROR?=
APT_PORTS_MIRROR?=
GORELEASER?=
export BUILD_TYPE?=
@@ -65,7 +72,7 @@ endif
TEST_PATHS?=./api/... ./pkg/... ./core/...
.PHONY: all test build vendor
.PHONY: all test build vendor lint lint-all
all: help
@@ -85,6 +92,7 @@ clean: ## Remove build related file
clean-tests:
rm -rf test-models
rm -rf test-dir
rm -f tests/e2e/mock-backend/mock-backend
## Install Go tools
install-go-tools:
@@ -143,32 +151,56 @@ osx-signed: build
run: ## run local-ai
CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
test-models/testmodel.ggml:
mkdir -p test-models
mkdir -p test-dir
wget -q https://huggingface.co/mradermacher/gpt2-alpaca-gpt4-GGUF/resolve/main/gpt2-alpaca-gpt4.Q4_K_M.gguf -O test-models/testmodel.ggml
wget -q https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin -O test-models/whisper-en
wget -q https://cdn.openai.com/whisper/draft-20220913a/micro-machines.wav -O test-dir/audio.wav
cp tests/models_fixtures/* test-models
prepare-test: protogen-go
cp tests/models_fixtures/* test-models
prepare-test: protogen-go build-mock-backend
########################################################
## Tests
########################################################
## Test targets
test: test-models/testmodel.ggml protogen-go
## After the test-suite reorg (see plans/test-reorg) the default `make test`
## no longer downloads multi-GB GGUF/whisper fixtures or builds llama-cpp /
## transformers / piper / whisper / stablediffusion-ggml. core/http/app_test.go
## now drives the mock-backend binary built by build-mock-backend; real-backend
## inference moved into tests/e2e-backends/ (per-backend, path-filtered) and
## tests/e2e-aio/ (nightly).
test: prepare-test
@echo 'Running tests'
export GO_TAGS="debug"
$(MAKE) prepare-test
OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
HUGGINGFACE_GRPC=$(abspath ./)/backend/python/transformers/run.sh TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="!llama-gguf" --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
$(MAKE) test-llama-gguf
$(MAKE) test-tts
$(MAKE) test-stablediffusion
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)
########################################################
## Lint
########################################################
## Runs golangci-lint with config from .golangci.yml. Includes the standard
## linter set plus forbidigo, which enforces the Ginkgo/Gomega-only test
## convention documented in .agents/coding-style.md.
##
## LINT_EXCLUDE_DIRS_RE matches directories whose Go packages can't typecheck
## without C/C++ headers we don't install in the lint runner (cgo wrappers
## around llama.cpp, piper/spdlog, silero-vad/onnxruntime, and Fyne/OpenGL for
## the launcher). Their compile-time correctness is enforced by their own
## build pipelines. Keep this as a deny list — `go list ./...` discovers
## everything else automatically, so new packages are scanned by default.
LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)
lint:
@command -v golangci-lint >/dev/null 2>&1 || { \
echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
exit 1; \
}
golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
## Like `lint` but reports every issue, including the pre-existing baseline
## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
## see what's available to clean up.
lint-all:
@command -v golangci-lint >/dev/null 2>&1 || { \
echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
exit 1; \
}
golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
########################################################
## E2E AIO tests (uses standard image with pre-configured models)
@@ -184,6 +216,8 @@ docker-build-e2e:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg GO_TAGS="$(GO_TAGS)" \
-t local-ai:tests -f Dockerfile .
@@ -198,6 +232,20 @@ run-e2e-aio: protogen-go
@echo 'Running e2e AIO tests'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio
# vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
# cpu-vllm backend from the current working tree, then drives a
# head + headless follower via testcontainers-go and asserts a chat
# completion. BuildKit caches both images, so re-runs only rebuild
# what changed. The test lives under tests/e2e/distributed and is
# selected by the VLLMMultinode label so it doesn't run alongside
# the other distributed-suite tests by default.
test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
@echo 'Running e2e vLLM multi-node DP test'
LOCALAI_IMAGE=local-ai \
LOCALAI_IMAGE_TAG=tests \
LOCALAI_VLLM_BACKEND_DIR=$(abspath ./local-backends/vllm) \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='VLLMMultinode' -v -r ./tests/e2e/distributed
########################################################
## E2E tests
########################################################
@@ -211,6 +259,8 @@ prepare-e2e:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
-t localai-tests .
@@ -235,20 +285,12 @@ teardown-e2e:
## Integration and unit tests
########################################################
test-llama-gguf: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="llama-gguf" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-tts: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="tts" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stablediffusion: prepare-test
TEST_DIR=$(abspath ./)/test-dir/ FIXTURES=$(abspath ./)/tests/fixtures CONFIG_FILE=$(abspath ./)/test-models/config.yaml MODELS_PATH=$(abspath ./)/test-models BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stablediffusion" --flake-attempts $(TEST_FLAKES) -v -r $(TEST_PATHS)
test-stores:
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="stores" --flake-attempts $(TEST_FLAKES) -v -r tests/integration
## Storage / vector-store integration. Requires the local-store backend to
## be available — we build it on demand and pass its location via
## BACKENDS_PATH (the model loader looks there for the gRPC binary).
test-stores: backends/local-store
BACKENDS_PATH=$(abspath ./)/backends \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r tests/integration
test-opus:
@echo 'Running opus backend tests'
@@ -260,6 +302,8 @@ test-opus-docker:
docker build --target builder \
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),) \
--build-arg BASE_IMAGE=$(or $(BASE_IMAGE),ubuntu:24.04) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
--build-arg BACKEND=opus \
-t localai-opus-test -f backend/Dockerfile.golang .
docker run --rm localai-opus-test \
@@ -269,23 +313,13 @@ test-realtime: build-mock-backend
@echo 'Running realtime e2e tests (mock backend)'
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime && !real-models" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# Real-model realtime tests. Set REALTIME_TEST_MODEL to use your own pipeline,
# or leave unset to auto-build one from the component env vars below.
# Container-based real-model realtime testing. Build env vars / pipeline
# definition kept here so test-realtime-models-docker can drive a fully wired
# pipeline (VAD + STT + LLM + TTS) from inside a containerised runner.
REALTIME_VAD?=silero-vad-ggml
REALTIME_STT?=whisper-1
REALTIME_LLM?=qwen3-0.6b
REALTIME_TTS?=tts-1
REALTIME_BACKENDS_PATH?=$(abspath ./)/backends
test-realtime-models: build-mock-backend
@echo 'Running realtime e2e tests (real models)'
REALTIME_TEST_MODEL=$${REALTIME_TEST_MODEL:-realtime-test-pipeline} \
REALTIME_VAD=$(REALTIME_VAD) \
REALTIME_STT=$(REALTIME_STT) \
REALTIME_LLM=$(REALTIME_LLM) \
REALTIME_TTS=$(REALTIME_TTS) \
REALTIME_BACKENDS_PATH=$(REALTIME_BACKENDS_PATH) \
$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter="Realtime" --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
# --- Container-based real-model testing ---
@@ -299,7 +333,7 @@ local-backends:
extract-backend-%: docker-build-% local-backends
@echo "Extracting backend $*..."
@CID=$$(docker create local-ai-backend:$*) && \
@CID=$$(docker create --entrypoint=/run.sh local-ai-backend:$*) && \
rm -rf local-backends/$* && mkdir -p local-backends/$* && \
docker cp $$CID:/ - | tar -xf - -C local-backends/$* && \
docker rm $$CID > /dev/null
@@ -311,6 +345,8 @@ test-realtime-models-docker: build-mock-backend
--build-arg BUILD_TYPE=$(or $(BUILD_TYPE),cublas) \
--build-arg CUDA_MAJOR_VERSION=$(or $(CUDA_MAJOR_VERSION),13) \
--build-arg CUDA_MINOR_VERSION=$(or $(CUDA_MINOR_VERSION),0) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t localai-test-runner .
docker run --rm \
$(REALTIME_DOCKER_FLAGS) \
@@ -394,7 +430,13 @@ protoc:
.PHONY: protogen-go
protogen-go: protoc install-go-tools
mkdir -p pkg/grpc/proto
./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
# install-go-tools writes protoc-gen-go and protoc-gen-go-grpc into
# $(shell go env GOPATH)/bin, which isn't on every dev's PATH. protoc
# resolves its code-gen plugins via PATH, so without this prefix the
# generate step fails with "protoc-gen-go: program not found". Prepend
# GOPATH/bin so the freshly-installed plugins win without requiring a
# shell-profile change.
PATH="$$(go env GOPATH)/bin:$$PATH" ./protoc --experimental_allow_proto3_optional -Ibackend/ --go_out=pkg/grpc/proto/ --go_opt=paths=source_relative --go-grpc_out=pkg/grpc/proto/ --go-grpc_opt=paths=source_relative \
backend/backend.proto
core/config/inference_defaults.json: ## Fetch inference defaults from unsloth (only if missing)
@@ -421,6 +463,7 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/vllm-omni
$(MAKE) -C backend/python/sglang
$(MAKE) -C backend/python/vibevoice
$(MAKE) -C backend/python/liquid-audio
$(MAKE) -C backend/python/moonshine
$(MAKE) -C backend/python/pocket-tts
$(MAKE) -C backend/python/qwen-tts
@@ -434,6 +477,8 @@ prepare-test-extra: protogen-python
$(MAKE) -C backend/python/ace-step
$(MAKE) -C backend/python/trl
$(MAKE) -C backend/python/tinygrad
$(MAKE) -C backend/python/insightface
$(MAKE) -C backend/python/speaker-recognition
$(MAKE) -C backend/rust/kokoros kokoros-grpc
test-extra: prepare-test-extra
@@ -444,6 +489,7 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/vllm test
$(MAKE) -C backend/python/vllm-omni test
$(MAKE) -C backend/python/vibevoice test
$(MAKE) -C backend/python/liquid-audio test
$(MAKE) -C backend/python/moonshine test
$(MAKE) -C backend/python/pocket-tts test
$(MAKE) -C backend/python/qwen-tts test
@@ -457,6 +503,8 @@ test-extra: prepare-test-extra
$(MAKE) -C backend/python/ace-step test
$(MAKE) -C backend/python/trl test
$(MAKE) -C backend/python/tinygrad test
$(MAKE) -C backend/python/insightface test
$(MAKE) -C backend/python/speaker-recognition test
$(MAKE) -C backend/rust/kokoros test
##
@@ -507,11 +555,20 @@ test-extra-backend: protogen-go
BACKEND_TEST_TOOL_NAME="$$BACKEND_TEST_TOOL_NAME" \
BACKEND_TEST_CACHE_TYPE_K="$$BACKEND_TEST_CACHE_TYPE_K" \
BACKEND_TEST_CACHE_TYPE_V="$$BACKEND_TEST_CACHE_TYPE_V" \
BACKEND_TEST_FACE_IMAGE_1_URL="$$BACKEND_TEST_FACE_IMAGE_1_URL" \
BACKEND_TEST_FACE_IMAGE_1_FILE="$$BACKEND_TEST_FACE_IMAGE_1_FILE" \
BACKEND_TEST_FACE_IMAGE_2_URL="$$BACKEND_TEST_FACE_IMAGE_2_URL" \
BACKEND_TEST_FACE_IMAGE_2_FILE="$$BACKEND_TEST_FACE_IMAGE_2_FILE" \
BACKEND_TEST_FACE_IMAGE_3_URL="$$BACKEND_TEST_FACE_IMAGE_3_URL" \
BACKEND_TEST_FACE_IMAGE_3_FILE="$$BACKEND_TEST_FACE_IMAGE_3_FILE" \
BACKEND_TEST_VERIFY_DISTANCE_CEILING="$$BACKEND_TEST_VERIFY_DISTANCE_CEILING" \
go test -v -timeout 30m ./tests/e2e-backends/...
## Convenience wrappers: build the image, then exercise it.
test-extra-backend-llama-cpp: docker-build-llama-cpp
BACKEND_IMAGE=local-ai-backend:llama-cpp $(MAKE) test-extra-backend
BACKEND_IMAGE=local-ai-backend:llama-cpp \
BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \
$(MAKE) test-extra-backend
test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend
@@ -539,6 +596,7 @@ test-extra-backend-llama-cpp-transcription: docker-build-llama-cpp
BACKEND_TEST_MMPROJ_URL=https://huggingface.co/ggml-org/Qwen3-ASR-0.6B-GGUF/resolve/main/mmproj-Qwen3-ASR-0.6B-Q8_0.gguf \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
BACKEND_TEST_CTX_SIZE=2048 \
$(MAKE) test-extra-backend
## vllm is resolved from a HuggingFace model id (no file download) and
@@ -553,6 +611,14 @@ test-extra-backend-vllm: docker-build-vllm
BACKEND_TEST_OPTIONS=tool_parser:hermes \
$(MAKE) test-extra-backend
## vllm multi-node data-parallel smoke test. Runs LocalAI head + a
## `local-ai p2p-worker vllm` follower in docker compose against
## Qwen2.5-0.5B with data_parallel_size=2. Requires 2 NVIDIA GPUs and
## nvidia-container-runtime on the host — vLLM v1's DP coordinator is
## not viable on CPU so this cannot run in CI without GPU.
test-extra-backend-vllm-multinode:
./tests/e2e/vllm-multinode/smoke.sh
## tinygrad mirrors the vllm target (same model, same caps, same parser) so
## the two backends are directly comparable. The LLM path covers Predict,
## streaming and native tool-call extraction. Companion targets below cover
@@ -603,6 +669,258 @@ test-extra-backend-tinygrad-all: \
test-extra-backend-tinygrad-sd \
test-extra-backend-tinygrad-whisper
## insightface — face recognition.
##
## Face fixtures default to the sample images shipped in the
## deepinsight/insightface repository (MIT-licensed). For offline/local
## runs override with BACKEND_TEST_FACE_IMAGE_{1,2,3}_FILE pointing at
## local paths.
FACE_IMAGE_1_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
FACE_IMAGE_2_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/t1.jpg
FACE_IMAGE_3_URL ?= https://github.com/deepinsight/insightface/raw/master/python-package/insightface/data/images/mask_white.jpg
## Known spoof fixture used by the face_antispoof e2e cap. This is
## upstream's own `image_F2.jpg` (Silent-Face repo, via yakhyo mirror)
## — verified to classify as is_real=false with score < 0.05 on the
## MiniFASNetV2 + MiniFASNetV1SE ensemble.
FACE_SPOOF_IMAGE_URL ?= https://github.com/yakhyo/face-anti-spoofing/raw/main/assets/image_F2.jpg
## Host-side cache for the OpenCV Zoo face ONNX files used by the
## opencv e2e target. The backend image no longer bakes model weights —
## gallery installs bring them via `files:` — but the e2e suite drives
## LoadModel over gRPC directly without going through the gallery. We
## pre-download the ONNX files to a stable host path and pass absolute
## paths in BACKEND_TEST_OPTIONS; `make` skips the downloads when the
## SHA-256 already matches.
INSIGHTFACE_OPENCV_DIR := /tmp/localai-insightface-opencv-cache
INSIGHTFACE_OPENCV_YUNET_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_detection_yunet/face_detection_yunet_2023mar.onnx
INSIGHTFACE_OPENCV_SFACE_URL := https://github.com/opencv/opencv_zoo/raw/main/models/face_recognition_sface/face_recognition_sface_2021dec.onnx
INSIGHTFACE_OPENCV_YUNET_SHA := 8f2383e4dd3cfbb4553ea8718107fc0423210dc964f9f4280604804ed2552fa4
INSIGHTFACE_OPENCV_SFACE_SHA := 0ba9fbfa01b5270c96627c4ef784da859931e02f04419c829e83484087c34e79
## buffalo_sc (insightface) — pack zip + SHA-256 mirrors the gallery
## entry so the e2e target matches exactly what `local-ai models install
## insightface-buffalo-sc` would have fetched. Smallest insightface pack
## (~16MB) — keeps CI fast while still covering the insightface engine
## code path end-to-end.
INSIGHTFACE_BUFFALO_SC_DIR := /tmp/localai-insightface-buffalo-sc-cache
INSIGHTFACE_BUFFALO_SC_URL := https://github.com/deepinsight/insightface/releases/download/v0.7/buffalo_sc.zip
INSIGHTFACE_BUFFALO_SC_SHA := 57d31b56b6ffa911c8a73cfc1707c73cab76efe7f13b675a05223bf42de47c72
## Silent-Face antispoofing (MiniFASNetV2 + MiniFASNetV1SE) — shared
## between the buffalo_sc and opencv e2e targets. Both ONNX files are
## ~1.7MB, Apache 2.0. URLs + SHAs mirror the gallery entries.
INSIGHTFACE_ANTISPOOF_DIR := /tmp/localai-insightface-antispoof-cache
INSIGHTFACE_ANTISPOOF_V2_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV2.onnx
INSIGHTFACE_ANTISPOOF_V2_SHA := b32929adc2d9c34b9486f8c4c7bc97c1b69bc0ea9befefc380e4faae4e463907
INSIGHTFACE_ANTISPOOF_V1SE_URL := https://github.com/yakhyo/face-anti-spoofing/releases/download/weights/MiniFASNetV1SE.onnx
INSIGHTFACE_ANTISPOOF_V1SE_SHA := ebab7f90c7833fbccd46d3a555410e78d969db5438e169b6524be444862b3676
.PHONY: insightface-opencv-models
insightface-opencv-models:
@mkdir -p $(INSIGHTFACE_OPENCV_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_YUNET_SHA)" ]; then \
echo "Fetching YuNet..."; \
curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx $(INSIGHTFACE_OPENCV_YUNET_URL); \
echo "$(INSIGHTFACE_OPENCV_YUNET_SHA) $(INSIGHTFACE_OPENCV_DIR)/yunet.onnx" | sha256sum -c; \
fi
@if [ "$$(sha256sum $(INSIGHTFACE_OPENCV_DIR)/sface.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_OPENCV_SFACE_SHA)" ]; then \
echo "Fetching SFace..."; \
curl -fsSL -o $(INSIGHTFACE_OPENCV_DIR)/sface.onnx $(INSIGHTFACE_OPENCV_SFACE_URL); \
echo "$(INSIGHTFACE_OPENCV_SFACE_SHA) $(INSIGHTFACE_OPENCV_DIR)/sface.onnx" | sha256sum -c; \
fi
.PHONY: insightface-antispoof-models
insightface-antispoof-models:
@mkdir -p $(INSIGHTFACE_ANTISPOOF_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V2_SHA)" ]; then \
echo "Fetching MiniFASNetV2..."; \
curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx $(INSIGHTFACE_ANTISPOOF_V2_URL); \
echo "$(INSIGHTFACE_ANTISPOOF_V2_SHA) $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx" | sha256sum -c; \
fi
@if [ "$$(sha256sum $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA)" ]; then \
echo "Fetching MiniFASNetV1SE..."; \
curl -fsSL -o $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx $(INSIGHTFACE_ANTISPOOF_V1SE_URL); \
echo "$(INSIGHTFACE_ANTISPOOF_V1SE_SHA) $(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx" | sha256sum -c; \
fi
.PHONY: insightface-buffalo-sc-models
insightface-buffalo-sc-models:
@mkdir -p $(INSIGHTFACE_BUFFALO_SC_DIR)
@if [ "$$(sha256sum $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip 2>/dev/null | awk '{print $$1}')" != "$(INSIGHTFACE_BUFFALO_SC_SHA)" ]; then \
echo "Fetching buffalo_sc..."; \
curl -fsSL -o $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip $(INSIGHTFACE_BUFFALO_SC_URL); \
echo "$(INSIGHTFACE_BUFFALO_SC_SHA) $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip" | sha256sum -c; \
rm -f $(INSIGHTFACE_BUFFALO_SC_DIR)/*.onnx; \
fi
@if [ ! -f "$(INSIGHTFACE_BUFFALO_SC_DIR)/det_500m.onnx" ]; then \
echo "Extracting buffalo_sc..."; \
unzip -o -q $(INSIGHTFACE_BUFFALO_SC_DIR)/buffalo_sc.zip -d $(INSIGHTFACE_BUFFALO_SC_DIR); \
fi
## buffalo_sc — smallest insightface pack (SCRFD-500MF detector + MBF
## recognizer, ~16MB). Exercises the insightface engine code path
## (model_zoo-backed inference) without the ~326MB buffalo_l download.
## No age/gender/landmark heads — face_analyze is dropped from caps.
## The pack is pre-fetched on the host and passed as `root:<dir>` since
## the e2e suite drives LoadModel directly without going through
## LocalAI's gallery flow (which is what would normally populate
## ModelPath and in turn the engine's `_model_dir` option).
test-extra-backend-insightface-buffalo-sc: docker-build-insightface insightface-buffalo-sc-models insightface-antispoof-models
BACKEND_IMAGE=local-ai-backend:insightface \
BACKEND_TEST_MODEL_NAME=insightface-buffalo-sc \
BACKEND_TEST_OPTIONS=engine:insightface,model_pack:buffalo_sc,root:$(INSIGHTFACE_BUFFALO_SC_DIR),antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
$(MAKE) test-extra-backend
## OpenCV Zoo YuNet + SFace — Apache 2.0, commercial-safe. face_analyze
## cap is dropped (SFace has no demographic head). The ONNX files are
## pre-fetched on the host via the insightface-opencv-models target and
## passed as absolute paths, since the e2e suite drives LoadModel
## directly without going through LocalAI's gallery flow.
test-extra-backend-insightface-opencv: docker-build-insightface insightface-opencv-models insightface-antispoof-models
BACKEND_IMAGE=local-ai-backend:insightface \
BACKEND_TEST_MODEL_NAME=insightface-opencv \
BACKEND_TEST_OPTIONS=engine:onnx_direct,detector_onnx:$(INSIGHTFACE_OPENCV_DIR)/yunet.onnx,recognizer_onnx:$(INSIGHTFACE_OPENCV_DIR)/sface.onnx,antispoof_v2_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV2.onnx,antispoof_v1se_onnx:$(INSIGHTFACE_ANTISPOOF_DIR)/MiniFASNetV1SE.onnx \
BACKEND_TEST_CAPS=health,load,face_detect,face_embed,face_verify,face_antispoof \
BACKEND_TEST_FACE_IMAGE_1_URL=$(FACE_IMAGE_1_URL) \
BACKEND_TEST_FACE_IMAGE_2_URL=$(FACE_IMAGE_2_URL) \
BACKEND_TEST_FACE_IMAGE_3_URL=$(FACE_IMAGE_3_URL) \
BACKEND_TEST_FACE_SPOOF_IMAGE_URL=$(FACE_SPOOF_IMAGE_URL) \
BACKEND_TEST_VERIFY_DISTANCE_CEILING=0.55 \
$(MAKE) test-extra-backend
## Aggregate — runs both face-recognition model configurations so CI
## catches regressions across engines together.
test-extra-backend-insightface-all: \
test-extra-backend-insightface-buffalo-sc \
test-extra-backend-insightface-opencv
## speaker-recognition — voice (speaker) biometrics.
##
## Audio fixtures default to the speechbrain test samples served
## straight from their GitHub repo — public, no auth needed, and they
## ship as 16kHz mono WAV/FLAC which is exactly what the engine wants.
## example{1,2,5} are three different speakers; the suite treats
## example1 as the "same-image twin" probe (verify(clip, clip) must
## return distance≈0) and the other two as cross-speaker ceilings.
## Override with BACKEND_TEST_VOICE_AUDIO_{1,2,3}_FILE for offline runs.
VOICE_AUDIO_1_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example1.wav
VOICE_AUDIO_2_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example2.flac
VOICE_AUDIO_3_URL ?= https://github.com/speechbrain/speechbrain/raw/develop/tests/samples/single-mic/example5.wav
## ECAPA-TDNN via SpeechBrain — default CI configuration. Auto-downloads
## the checkpoint from HuggingFace on first LoadModel (bundled in the
## backend image pip install). 192-d embeddings, cosine-distance based.
## The e2e suite drives LoadModel directly so we don't rely on LocalAI's
## gallery flow here.
test-extra-backend-speaker-recognition-ecapa: docker-build-speaker-recognition
BACKEND_IMAGE=local-ai-backend:speaker-recognition \
BACKEND_TEST_MODEL_NAME=speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_OPTIONS=engine:speechbrain,source:speechbrain/spkrec-ecapa-voxceleb \
BACKEND_TEST_CAPS=health,load,voice_embed,voice_verify \
BACKEND_TEST_VOICE_AUDIO_1_URL=$(VOICE_AUDIO_1_URL) \
BACKEND_TEST_VOICE_AUDIO_2_URL=$(VOICE_AUDIO_2_URL) \
BACKEND_TEST_VOICE_AUDIO_3_URL=$(VOICE_AUDIO_3_URL) \
BACKEND_TEST_VOICE_VERIFY_DISTANCE_CEILING=0.4 \
$(MAKE) test-extra-backend
## Aggregate — today there's only one voice config; the target exists
## so the CI workflow matches the insightface-all naming convention and
## can grow to include WeSpeaker / 3D-Speaker later.
test-extra-backend-speaker-recognition-all: \
test-extra-backend-speaker-recognition-ecapa
## Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked
## LLM. Extracts the sherpa-onnx Docker image rootfs, downloads the three
## gallery-referenced model bundles (silero-vad, omnilingual-asr, vits-ljs),
## writes the corresponding model config YAMLs, and runs the realtime
## websocket spec in tests/e2e with REALTIME_* env vars wiring the sherpa
## slots into the pipeline. The LLM slot stays on the in-repo mock-backend
## registered unconditionally by tests/e2e/e2e_suite_test.go. See
## tests/e2e/run-realtime-sherpa.sh for the full orchestration.
test-extra-e2e-realtime-sherpa: build-mock-backend docker-build-sherpa-onnx protogen-go react-ui
bash tests/e2e/run-realtime-sherpa.sh
## Streaming ASR via the sherpa-onnx online recognizer. Uses the streaming
## zipformer English model (encoder/decoder/joiner int8 + tokens) from the
## sherpa-onnx gallery entry. Drives both AudioTranscription and
## AudioTranscriptionStream via the e2e-backends gRPC harness; streaming
## emits real partial deltas during decode. Each file is renamed on download
## to the shape sherpa-onnx's online loader expects (encoder.int8.onnx etc.).
test-extra-backend-sherpa-onnx-transcription: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/encoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#encoder.int8.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/decoder-epoch-99-avg-1-chunk-16-left-128.int8.onnx#decoder.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/joiner-epoch-99-avg-1-chunk-16-left-128.int8.onnx#joiner.int8.onnx|https://huggingface.co/csukuangfj/sherpa-onnx-streaming-zipformer-en-2023-06-26/resolve/main/tokens.txt' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
BACKEND_TEST_OPTIONS=subtype=online \
$(MAKE) test-extra-backend
## VITS TTS via the sherpa-onnx backend. Pulls the individual files from
## HuggingFace (the vits-ljs release tarball lives on the k2-fsa github
## but is also mirrored as discrete files on HF). Exercises both
## TTS (write-to-file) and TTSStream (PCM chunks + WAV header) via the
## e2e-backends gRPC harness.
test-extra-backend-sherpa-onnx-tts: docker-build-sherpa-onnx
BACKEND_IMAGE=local-ai-backend:sherpa-onnx \
BACKEND_TEST_MODEL_URL='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/vits-ljs.onnx#vits-ljs.onnx' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/csukuangfj/vits-ljs/resolve/main/tokens.txt|https://huggingface.co/csukuangfj/vits-ljs/resolve/main/lexicon.txt' \
BACKEND_TEST_CAPS=health,load,tts \
$(MAKE) test-extra-backend
## VibeVoice TTS via the vibevoice-cpp backend. ModelFile is the
## realtime gguf; the supplementary tokenizer + voice prompt land
## alongside it under the harness's models dir and are wired through
## via the standard Options[] convention (tokenizer=, voice=).
test-extra-backend-vibevoice-cpp-tts: docker-build-vibevoice-cpp
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-realtime-0.5B-q8_0.gguf#vibevoice-realtime-0.5B-q8_0.gguf' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf|https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/voice-en-Carter_man.gguf#voice-en-Carter_man.gguf' \
BACKEND_TEST_OPTIONS=tokenizer:tokenizer.gguf,voice:voice-en-Carter_man.gguf \
BACKEND_TEST_CAPS=health,load,tts \
$(MAKE) test-extra-backend
## VibeVoice ASR (long-form, with diarization). type=asr tells the
## backend's Load() to slot ModelFile into the asr_model role; the
## tokenizer is supplied via Options[]. Uses the Q4_K quant (~10 GB)
## rather than Q8_0 (~14 GB) so the bundle fits inside ubuntu-latest's
## post-image disk budget.
test-extra-backend-vibevoice-cpp-transcription: docker-build-vibevoice-cpp
BACKEND_IMAGE=local-ai-backend:vibevoice-cpp \
BACKEND_TEST_MODEL_URL='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/vibevoice-asr-q4_k.gguf#vibevoice-asr-q4_k.gguf' \
BACKEND_TEST_EXTRA_FILES='https://huggingface.co/mudler/vibevoice.cpp-models/resolve/main/tokenizer.gguf#tokenizer.gguf' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_OPTIONS=type:asr,tokenizer:tokenizer.gguf \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the whisper.cpp backend.
## Drives the AudioTranscription / AudioTranscriptionStream RPCs against
## ggml-base.en (~145 MB) using the JFK 11s clip. The streaming spec
## asserts len(deltas) >= 1 and concat(deltas) == final.Text - whisper-
## specific multi-segment assertions live in backend/go/whisper/gowhisper_test.go.
test-extra-backend-whisper-transcription: docker-build-whisper
BACKEND_IMAGE=local-ai-backend:whisper \
BACKEND_TEST_MODEL_URL=https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,transcription \
$(MAKE) test-extra-backend
## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
## Exercises the audio_transform capability end-to-end: batch transform
## of a real WAV fixture and bidi streaming of synthetic silent frames.
test-extra-backend-localvqe-transform: docker-build-localvqe
BACKEND_IMAGE=local-ai-backend:localvqe \
BACKEND_TEST_MODEL_URL='https://huggingface.co/LocalAI-io/LocalVQE/resolve/main/localvqe-v1-1.3M-f32.gguf#localvqe-v1-1.3M-f32.gguf' \
BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
BACKEND_TEST_CAPS=health,load,audio_transform \
$(MAKE) test-extra-backend
## sglang mirrors the vllm setup: HuggingFace model id, same tiny Qwen,
## tool-call extraction via sglang's native qwen parser. CPU builds use
## sglang's upstream pyproject_cpu.toml recipe (see backend/python/sglang/install.sh).
@@ -645,6 +963,8 @@ docker:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE) .
docker-cuda12:
@@ -658,11 +978,13 @@ docker-cuda12:
--build-arg BUILD_TYPE=$(BUILD_TYPE) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE)-cuda-12 .
docker-image-intel:
docker build \
--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04 \
--build-arg BASE_IMAGE=intel/oneapi-basekit:2025.3.2-0-devel-ubuntu24.04 \
--build-arg IMAGE_TYPE=$(IMAGE_TYPE) \
--build-arg GO_TAGS="$(GO_TAGS)" \
--build-arg MAKEFLAGS="$(DOCKER_MAKEFLAGS)" \
@@ -671,6 +993,8 @@ docker-image-intel:
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
-t $(DOCKER_IMAGE) .
########################################################
@@ -687,6 +1011,10 @@ backends/llama-cpp-darwin: build
bash ./scripts/build/llama-cpp-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/llama-cpp.tar)"
backends/ds4-darwin: build
bash ./scripts/build/ds4-darwin.sh
./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"
build-darwin-python-backend: build
bash ./scripts/build/python-darwin.sh
@@ -728,6 +1056,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
# Single-model; hardware-only validation lives at tests/e2e-backends/
# (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
BACKEND_DS4 = ds4|ds4|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -739,7 +1071,10 @@ BACKEND_WHISPER = whisper|golang|.|false|true
BACKEND_VOXTRAL = voxtral|golang|.|false|true
BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
BACKEND_LOCALVQE = localvqe|golang|.|false|true
BACKEND_OPUS = opus|golang|.|false|true
BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
# Python backends with root context
BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -748,6 +1083,8 @@ BACKEND_OUTETTS = outetts|python|.|false|true
BACKEND_FASTER_WHISPER = faster-whisper|python|.|false|true
BACKEND_COQUI = coqui|python|.|false|true
BACKEND_RFDETR = rfdetr|python|.|false|true
BACKEND_INSIGHTFACE = insightface|python|.|false|true
BACKEND_SPEAKER_RECOGNITION = speaker-recognition|python|.|false|true
BACKEND_KITTEN_TTS = kitten-tts|python|.|false|true
BACKEND_NEUTTS = neutts|python|.|false|true
BACKEND_KOKORO = kokoro|python|.|false|true
@@ -757,6 +1094,7 @@ BACKEND_SGLANG = sglang|python|.|false|true
BACKEND_DIFFUSERS = diffusers|python|.|--progress=plain|true
BACKEND_CHATTERBOX = chatterbox|python|.|false|true
BACKEND_VIBEVOICE = vibevoice|python|.|--progress=plain|true
BACKEND_LIQUID_AUDIO = liquid-audio|python|.|--progress=plain|true
BACKEND_MOONSHINE = moonshine|python|.|false|true
BACKEND_POCKET_TTS = pocket-tts|python|.|false|true
BACKEND_QWEN_TTS = qwen-tts|python|.|false|true
@@ -790,7 +1128,10 @@ define docker-build-backend
--build-arg CUDA_MINOR_VERSION=$(CUDA_MINOR_VERSION) \
--build-arg UBUNTU_VERSION=$(UBUNTU_VERSION) \
--build-arg UBUNTU_CODENAME=$(UBUNTU_CODENAME) \
--build-arg APT_MIRROR=$(APT_MIRROR) \
--build-arg APT_PORTS_MIRROR=$(APT_PORTS_MIRROR) \
$(if $(FROM_SOURCE),--build-arg FROM_SOURCE=$(FROM_SOURCE)) \
$(if $(AMDGPU_TARGETS),--build-arg AMDGPU_TARGETS=$(AMDGPU_TARGETS)) \
$(if $(filter true,$(5)),--build-arg BACKEND=$(1)) \
-t local-ai-backend:$(1) -f backend/Dockerfile.$(2) $(3)
endef
@@ -805,6 +1146,7 @@ endef
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -819,6 +1161,8 @@ $(eval $(call generate-docker-build-target,$(BACKEND_OUTETTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_FASTER_WHISPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_COQUI)))
$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR)))
$(eval $(call generate-docker-build-target,$(BACKEND_INSIGHTFACE)))
$(eval $(call generate-docker-build-target,$(BACKEND_SPEAKER_RECOGNITION)))
$(eval $(call generate-docker-build-target,$(BACKEND_KITTEN_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_NEUTTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKORO)))
@@ -828,6 +1172,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SGLANG)))
$(eval $(call generate-docker-build-target,$(BACKEND_DIFFUSERS)))
$(eval $(call generate-docker-build-target,$(BACKEND_CHATTERBOX)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE)))
$(eval $(call generate-docker-build-target,$(BACKEND_LIQUID_AUDIO)))
$(eval $(call generate-docker-build-target,$(BACKEND_MOONSHINE)))
$(eval $(call generate-docker-build-target,$(BACKEND_POCKET_TTS)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN_TTS)))
@@ -840,6 +1185,8 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
$(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_VLM)))
$(eval $(call generate-docker-build-target,$(BACKEND_MLX_DISTRIBUTED)))
@@ -848,12 +1195,13 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
$(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
$(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
$(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
# Pattern rule for docker-save targets
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
########################################################
### Mock Backend for E2E Tests

View File

@@ -38,7 +38,7 @@
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
Created and maintained by [Ettore Di Giacinto](https://github.com/mudler).
Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
> [:book: Documentation](https://localai.io/) | [:speech_balloon: Discord](https://discord.gg/uJAeKSAGDy) | [💻 Quickstart](https://localai.io/basics/getting_started/) | [🖼️ Models](https://models.localai.io/) | [❓FAQ](https://localai.io/faq/)
@@ -149,6 +149,7 @@ For more details, see the [Getting Started guide](https://localai.io/basics/gett
## Latest News
- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
@@ -200,13 +201,14 @@ See the full [Backend & Model Compatibility Table](https://localai.io/model-comp
- [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples)
## Autonomous Development Team
## Team
LocalAI is helped being maintained by a team of autonomous AI agents led by an AI Scrum Master.
LocalAI is maintained by a small team of humans, together with the wider community of contributors.
- **Live Reports**: [reports.localai.io](http://reports.localai.io)
- **Project Board**: [Agent task tracking](https://github.com/users/mudler/projects/6)
- **Blog Post**: [Learn about the experiment](https://mudler.pm/posts/2026/02/28/a-call-to-open-source-maintainers-stop-babysitting-ai-how-i-built-a-100-local-autonomous-dev-team-to-maintain-localai-and-why-you-should-too/)
- **[Ettore Di Giacinto](https://github.com/mudler)** — original author and project lead
- **[Richard Palethorpe](https://github.com/richiejp)** — maintainer
A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in [Discord](https://discord.gg/uJAeKSAGDy) — LocalAI is a community-driven project and wouldn't exist without you. See the full [contributors list](https://github.com/mudler/LocalAI/graphs/contributors).
## Citation
@@ -249,7 +251,7 @@ A special thanks to individual sponsors, a full list is on [GitHub](https://gith
## License
LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/).
LocalAI is a community-driven project created by [Ettore Di Giacinto](https://github.com/mudler/) and maintained by the [LocalAI team](#team).
MIT - Author Ettore Di Giacinto <mudler@localai.io>

View File

@@ -0,0 +1,98 @@
# syntax=docker/dockerfile:1.7
#
# Pre-built builder base image for LocalAI's C++ backends.
#
# This Dockerfile is the source of truth for the
# `quay.io/go-skynet/ci-cache:base-grpc-*` images that
# `.github/workflows/base-images.yml` builds and pushes. The output of a
# build is a fully-prepped builder layer containing:
#
# - apt build deps (build-essential, ccache, git, make, pkg-config,
# libcurl4-openssl-dev, libssl-dev, curl, unzip, wget, ca-certificates)
# - cmake (apt or, when CMAKE_FROM_SOURCE=true, compiled from
# ${CMAKE_VERSION})
# - protoc v27.1 at /usr/local/bin/protoc
# - gRPC ${GRPC_VERSION} compiled and installed at /opt/grpc
# - Conditional CUDA toolkit (BUILD_TYPE=cublas|l4t, SKIP_DRIVERS=false)
# including the cuda-13 + arm64 cudss/nvpl special case
# - Conditional ROCm/HIP build deps (BUILD_TYPE=hipblas)
# - Conditional Vulkan SDK 1.4.335.0 (BUILD_TYPE=vulkan)
#
# Variants built by the workflow (matrix in base-images.yml):
#
# base-grpc-amd64 ubuntu:24.04, CPU-only
# base-grpc-arm64 ubuntu:24.04, CPU-only
# base-grpc-cuda-12-amd64 ubuntu:24.04 + CUDA 12.8
# base-grpc-cuda-13-amd64 ubuntu:22.04 + CUDA 13.0
# base-grpc-cuda-13-arm64 ubuntu:24.04 + CUDA 13.0 (sbsa)
# base-grpc-l4t-cuda-12-arm64 ubuntu:22.04 + CUDA 12.x (legacy JetPack)
# base-grpc-rocm-amd64 rocm/dev-ubuntu-24.04:7.2.1 + hipblas
# base-grpc-vulkan-amd64 ubuntu:24.04 + Vulkan SDK 1.4.335
# base-grpc-vulkan-arm64 ubuntu:24.04 + Vulkan SDK ARM 1.4.335
# base-grpc-intel-amd64 intel/oneapi-basekit:2025.3.2 (sycl)
#
# This is a SINGLE-stage Dockerfile by design: the final image IS the
# builder base. The intermediate gRPC compile happens inside this same
# stage so consumer Dockerfiles in PR 2 can simply
# `FROM quay.io/go-skynet/ci-cache:base-grpc-<variant>` without needing a
# COPY --from=grpc step. /opt/grpc is the canonical install prefix and
# downstream builds will add it to CMAKE_PREFIX_PATH (or copy to
# /usr/local) the same way Dockerfile.llama-cpp does today.
#
# Install logic lives in .docker/install-base-deps.sh, which is also
# bind-mounted by the variant Dockerfiles' builder-fromsource stage.
# This guarantees bit-equivalence between the prebuilt CI base and the
# from-source local-dev path — both invoke the same script with the
# same env inputs.
ARG BASE_IMAGE=ubuntu:24.04
FROM ${BASE_IMAGE}
ARG BASE_IMAGE=ubuntu:24.04
ARG BUILD_TYPE=""
ARG CUDA_MAJOR_VERSION=""
ARG CUDA_MINOR_VERSION=""
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain
# detection / arch table issues.
ARG CMAKE_VERSION=3.31.10
ARG GRPC_VERSION=v1.65.0
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG SKIP_DRIVERS=false
ARG TARGETARCH
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
ARG AMDGPU_TARGETS=""
ENV BUILD_TYPE=${BUILD_TYPE} \
CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
CMAKE_VERSION=${CMAKE_VERSION} \
GRPC_VERSION=${GRPC_VERSION} \
GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
SKIP_DRIVERS=${SKIP_DRIVERS} \
TARGETARCH=${TARGETARCH} \
UBUNTU_VERSION=${UBUNTU_VERSION} \
APT_MIRROR=${APT_MIRROR} \
APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
AMDGPU_TARGETS=${AMDGPU_TARGETS} \
MAKEFLAGS=${GRPC_MAKEFLAGS} \
DEBIAN_FRONTEND=noninteractive
# CUDA on PATH (no-op when CUDA isn't installed)
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
ENV PATH=/opt/rocm/bin:${PATH}
WORKDIR /build
# Single RUN that delegates to .docker/install-base-deps.sh — the same
# script the variant Dockerfiles' builder-fromsource stage runs.
RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
--mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
bash /usr/local/sbin/install-base-deps
WORKDIR /

41
backend/Dockerfile.ds4 Normal file
View File

@@ -0,0 +1,41 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# BASE_IMAGE is either ubuntu:24.04 (for cpu builds) or nvidia/cuda:13.0.0-devel-ubuntu24.04
# (for cublas builds). Both ship apt + Ubuntu Noble packages; the nvidia/cuda base
# additionally provides /usr/local/cuda. Darwin (Metal) builds bypass this Dockerfile
# entirely via scripts/build/ds4-darwin.sh.
FROM ${BASE_IMAGE} AS builder
ARG BUILD_TYPE
ARG TARGETARCH
ARG TARGETVARIANT
ENV BUILD_TYPE=${BUILD_TYPE} \
DEBIAN_FRONTEND=noninteractive \
PATH=/usr/local/cuda/bin:${PATH}
WORKDIR /build
# Install build-time deps via plain apt - install-base-deps.sh's full pipeline
# (CUDA keyring + from-source gRPC) is unnecessary here:
# - CUDA: when BASE_IMAGE=nvidia/cuda:*, /usr/local/cuda is already populated;
# for the cpu build we don't need CUDA at all.
# - gRPC/Protobuf: system apt packages are sufficient; ds4's wrapper only links
# against them, it doesn't ship the gRPC source tree.
# - nlohmann-json: dsml_renderer's only third-party dep.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
git cmake build-essential pkg-config ca-certificates \
libgrpc++-dev libprotobuf-dev protobuf-compiler protobuf-compiler-grpc \
nlohmann-json3-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
COPY . /LocalAI
RUN --mount=type=cache,target=/root/.ccache,id=ds4-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
make -C /LocalAI/backend/cpp/ds4 BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
FROM scratch
COPY --from=builder /LocalAI/backend/cpp/ds4/package/. ./

View File

@@ -1,4 +1,6 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=rerankers
@@ -14,8 +16,20 @@ ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
# gcc-14 is the default on noble (ubuntu:24.04) but absent from jammy
# (the L4T jetpack r36.4.0 base). LocalVQE specifically needs it; the
# other Go backends compile fine with the default gcc shipped via
# build-essential. So: try gcc-14 from the configured repos, fall back
# gracefully when it's not available so jammy-based builds don't fail
# at the apt step.
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
git ccache \
@@ -23,6 +37,12 @@ RUN apt-get update && \
make cmake wget libopenblas-dev \
curl unzip \
libssl-dev && \
if apt-cache show gcc-14 >/dev/null 2>&1 && apt-cache show g++-14 >/dev/null 2>&1; then \
apt-get install -y --no-install-recommends gcc-14 g++-14 && \
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-14 100 \
--slave /usr/bin/g++ g++ /usr/bin/g++-14 \
--slave /usr/bin/gcov gcov /usr/bin/gcov-14; \
fi && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@@ -147,6 +167,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \

View File

@@ -1,279 +1,149 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
# when no prebuilt base is supplied. The builder-prebuilt stage is only
# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
# content here is harmless — BuildKit prunes the unreferenced builder.
ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_TARGET selects which builder stage the final scratch image copies
# package output from. Declared at global scope (before any FROM) so it's
# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
# `make backends/ik-llama-cpp` on the from-source path.
ARG BUILDER_TARGET=builder-fromsource
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
# ============================================================================
# Stage: builder-fromsource — self-contained build path.
# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
# default; local `make backends/ik-llama-cpp`).
#
# The install script is the same one that backend/Dockerfile.base-grpc-builder
# runs, so the result is bit-equivalent to the prebuilt-base path
# (builder-prebuilt below).
# ============================================================================
FROM ${BASE_IMAGE} AS builder-fromsource
ARG BUILD_TYPE
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG GRPC_VERSION=v1.65.0
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ARG AMDGPU_TARGETS=""
ARG BACKEND=rerankers
# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ARG CMAKE_ARGS
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV BUILD_TYPE=${BUILD_TYPE} \
CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
CMAKE_VERSION=${CMAKE_VERSION} \
GRPC_VERSION=${GRPC_VERSION} \
GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
SKIP_DRIVERS=${SKIP_DRIVERS} \
TARGETARCH=${TARGETARCH} \
UBUNTU_VERSION=${UBUNTU_VERSION} \
APT_MIRROR=${APT_MIRROR} \
APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
AMDGPU_TARGETS=${AMDGPU_TARGETS} \
CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
CMAKE_ARGS=${CMAKE_ARGS} \
DEBIAN_FRONTEND=noninteractive
# Cuda
# CUDA on PATH (no-op when CUDA isn't installed)
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
ENV PATH=/opt/rocm/bin:${PATH}
WORKDIR /build
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
# Install everything via the shared script — the same one that
# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
# this from-source path are bit-equivalent.
RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
--mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
bash /usr/local/sbin/install-base-deps
# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
# CMake's find_package finds it at the canonical prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/ik-llama-cpp-*-build
fi
cd /LocalAI/backend/cpp/ik-llama-cpp
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
# ARM64 / ROCm: build without x86 SIMD
make ik-llama-cpp-fallback
else
# ik_llama.cpp's IQK kernels require at least AVX2
make ik-llama-cpp-avx2
fi
EOT
# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
# for the rationale. Distinct mount id so ik-llama-cpp's cache doesn't
# overlap with llama-cpp's — ik_llama.cpp is a different fork with
# different source.
#
# The compile body is shared with builder-prebuilt via .docker/ik-llama-cpp-compile.sh.
RUN --mount=type=bind,source=.docker/ik-llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/ik-llama-cpp package
# ============================================================================
# Stage: builder-prebuilt — uses the pre-built base from
# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
# builder-base-image).
# ============================================================================
FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG TARGETARCH
ARG TARGETVARIANT
# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
# /usr/local. Mirror what the from-source path does so the compile step
# can find gRPC at the canonical prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN --mount=type=bind,source=.docker/ik-llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=ik-llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
RUN make -BC /LocalAI/backend/cpp/ik-llama-cpp package
# ============================================================================
# Final stage — copies package output from one of the two builders.
# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
#
# BuildKit doesn't support variable expansion in `COPY --from=` directly,
# so we resolve the ARG by aliasing the chosen builder to a fixed stage
# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
# BUILDER_TARGET itself is declared as a global ARG at the top of this
# file (required for use in FROM), so we just re-import it into this
# stage's scope before the FROM directive.
# ============================================================================
FROM ${BUILDER_TARGET} AS builder
FROM scratch

View File

@@ -1,290 +1,155 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
# when no prebuilt base is supplied. The builder-prebuilt stage is only
# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
# content here is harmless — BuildKit prunes the unreferenced builder.
ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_TARGET selects which builder stage the final scratch image copies
# package output from. Declared at global scope (before any FROM) so it's
# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
# `make backends/llama-cpp` on the from-source path.
ARG BUILDER_TARGET=builder-fromsource
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
# ============================================================================
# Stage: builder-fromsource — self-contained build path.
# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
# default; local `make backends/llama-cpp`).
#
# The install script is the same one that backend/Dockerfile.base-grpc-builder
# runs, so the result is bit-equivalent to the prebuilt-base path
# (builder-prebuilt below).
# ============================================================================
FROM ${BASE_IMAGE} AS builder-fromsource
ARG BUILD_TYPE
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG GRPC_VERSION=v1.65.0
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ARG AMDGPU_TARGETS
# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ARG CMAKE_ARGS
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV BUILD_TYPE=${BUILD_TYPE} \
CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
CMAKE_VERSION=${CMAKE_VERSION} \
GRPC_VERSION=${GRPC_VERSION} \
GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
SKIP_DRIVERS=${SKIP_DRIVERS} \
TARGETARCH=${TARGETARCH} \
UBUNTU_VERSION=${UBUNTU_VERSION} \
APT_MIRROR=${APT_MIRROR} \
APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
AMDGPU_TARGETS=${AMDGPU_TARGETS} \
CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
CMAKE_ARGS=${CMAKE_ARGS} \
DEBIAN_FRONTEND=noninteractive
# Cuda
# CUDA on PATH (no-op when CUDA isn't installed)
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
ENV PATH=/opt/rocm/bin:${PATH}
WORKDIR /build
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
# Install everything via the shared script — the same one that
# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
# this from-source path are bit-equivalent.
RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
--mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
bash /usr/local/sbin/install-base-deps
# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
# CMake's find_package finds it at the canonical prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/llama-cpp-*-build
fi
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
cd /LocalAI/backend/cpp/llama-cpp
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
else
cd /LocalAI/backend/cpp/llama-cpp
make llama-cpp-avx
make llama-cpp-avx2
make llama-cpp-avx512
make llama-cpp-fallback
make llama-cpp-grpc
make llama-cpp-rpc-server
fi
EOT
# BuildKit cache mount for ccache. Persists compiler outputs across builds
# via the registry cache (cache-to: type=registry,mode=max in CI). On a
# LLAMA_VERSION bump most TUs are byte-identical to the previous version's
# preprocessed source — ccache returns the previous .o file and skips the
# real compile. Same for LocalAI source changes that don't touch llama.cpp.
# CMAKE_*_COMPILER_LAUNCHER threads ccache through CMake to wrap gcc/g++/nvcc.
# sharing=locked serializes concurrent writes if multiple matrix variants
# share the same cache mount id.
#
# The compile body is shared with builder-prebuilt via .docker/llama-cpp-compile.sh.
RUN --mount=type=bind,source=.docker/llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/llama-cpp package
# ============================================================================
# Stage: builder-prebuilt — uses the pre-built base from
# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
# builder-base-image).
# ============================================================================
FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG TARGETARCH
ARG TARGETVARIANT
# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
# /usr/local. The variant Dockerfile's from-source path does that too;
# mirror it here so the compile step can find gRPC at the canonical
# prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN --mount=type=bind,source=.docker/llama-cpp-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
RUN make -BC /LocalAI/backend/cpp/llama-cpp package
# ============================================================================
# Final stage — copies package output from one of the two builders.
# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
#
# BuildKit doesn't support variable expansion in `COPY --from=` directly,
# so we resolve the ARG by aliasing the chosen builder to a fixed stage
# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
# BUILDER_TARGET itself is declared as a global ARG at the top of this
# file (required for use in FROM), so we just re-import it into this
# stage's scope before the FROM directive.
# ============================================================================
FROM ${BUILDER_TARGET} AS builder
FROM scratch

View File

@@ -1,4 +1,6 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=rerankers
@@ -13,8 +15,12 @@ ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache \
@@ -162,6 +168,7 @@ RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
hipblaslt-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
@@ -202,6 +209,13 @@ COPY scripts/build/package-gpu-libs.sh /package-gpu-libs.sh
ARG FROM_SOURCE=""
ENV FROM_SOURCE=${FROM_SOURCE}
# Cache-buster for the per-backend `make` step. Most Python backends list
# unpinned deps (torch, transformers, vllm, ...), so a warm registry cache
# would otherwise freeze upstream versions indefinitely. CI passes a value
# that rolls weekly so the install layer is rebuilt at most once per week
# and picks up newer wheels from PyPI / nightly indexes.
ARG DEPS_REFRESH=initial
RUN cd /${BACKEND} && PORTABLE_PYTHON=true make
# Package GPU libraries into the backend's lib directory
@@ -216,4 +230,4 @@ RUN if [ -f "/${BACKEND}/package.sh" ]; then \
FROM scratch
ARG BACKEND=rerankers
COPY --from=builder /${BACKEND}/ /
COPY --from=builder /${BACKEND}/ /

View File

@@ -1,12 +1,18 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
FROM ${BASE_IMAGE} AS builder
ARG BACKEND=kokoros
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
RUN apt-get update && \
RUN --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
APT_MIRROR="${APT_MIRROR}" APT_PORTS_MIRROR="${APT_PORTS_MIRROR}" sh /usr/local/sbin/apt-mirror && \
apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
git ccache \

View File

@@ -1,288 +1,158 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
# when no prebuilt base is supplied. The builder-prebuilt stage is only
# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
# content here is harmless — BuildKit prunes the unreferenced builder.
ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
# BUILDER_TARGET selects which builder stage the final scratch image copies
# package output from. Declared at global scope (before any FROM) so it's
# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
# `make backends/turboquant` on the from-source path.
ARG BUILDER_TARGET=builder-fromsource
ARG APT_MIRROR=""
ARG APT_PORTS_MIRROR=""
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
# ============================================================================
# Stage: builder-fromsource — self-contained build path.
# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
# default; local `make backends/turboquant`).
#
# The install script is the same one that backend/Dockerfile.base-grpc-builder
# runs, so the result is bit-equivalent to the prebuilt-base path
# (builder-prebuilt below).
# ============================================================================
FROM ${BASE_IMAGE} AS builder-fromsource
ARG BUILD_TYPE
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG GRPC_VERSION=v1.65.0
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
ARG APT_MIRROR
ARG APT_PORTS_MIRROR
ARG AMDGPU_TARGETS=""
ARG BACKEND=rerankers
# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ARG CMAKE_ARGS
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV BUILD_TYPE=${BUILD_TYPE} \
CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
CMAKE_VERSION=${CMAKE_VERSION} \
GRPC_VERSION=${GRPC_VERSION} \
GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
SKIP_DRIVERS=${SKIP_DRIVERS} \
TARGETARCH=${TARGETARCH} \
UBUNTU_VERSION=${UBUNTU_VERSION} \
APT_MIRROR=${APT_MIRROR} \
APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
AMDGPU_TARGETS=${AMDGPU_TARGETS} \
CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
CMAKE_ARGS=${CMAKE_ARGS} \
DEBIAN_FRONTEND=noninteractive
# Cuda
# CUDA on PATH (no-op when CUDA isn't installed)
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
ENV PATH=/opt/rocm/bin:${PATH}
WORKDIR /build
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
# Install everything via the shared script — the same one that
# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
# this from-source path are bit-equivalent.
RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
--mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
bash /usr/local/sbin/install-base-deps
# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
# CMake's find_package finds it at the canonical prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/turboquant-*-build
fi
cd /LocalAI/backend/cpp/turboquant
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
else
make turboquant-avx
make turboquant-avx2
make turboquant-avx512
make turboquant-fallback
make turboquant-grpc
make turboquant-rpc-server
fi
EOT
# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
# for rationale. turboquant is a llama.cpp fork that reuses
# backend/cpp/llama-cpp source via a thin wrapper Makefile, so MOST TUs
# are content-identical to the upstream llama-cpp build. Sharing a cache
# id with llama-cpp could give cross-fork hits — but for now keep them
# separate so a regression in one doesn't poison the other. Revisit
# sharing after measuring the actual hit rate.
#
# The compile body is shared with builder-prebuilt via .docker/turboquant-compile.sh.
RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/turboquant package
# ============================================================================
# Stage: builder-prebuilt — uses the pre-built base from
# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
# builder-base-image).
# ============================================================================
FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
# time. The builder-fromsource stage above already does this; mirror it here.
ARG AMDGPU_TARGETS
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
ARG TARGETARCH
ARG TARGETVARIANT
# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
# /usr/local. Mirror what the from-source path does so the compile step
# can find gRPC at the canonical prefix the Makefile expects.
RUN cp -a /opt/grpc/. /usr/local/
COPY . /LocalAI
RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
--mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
bash /usr/local/sbin/compile.sh
RUN make -BC /LocalAI/backend/cpp/turboquant package
# ============================================================================
# Final stage — copies package output from one of the two builders.
# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
#
# BuildKit doesn't support variable expansion in `COPY --from=` directly,
# so we resolve the ARG by aliasing the chosen builder to a fixed stage
# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
# BUILDER_TARGET itself is declared as a global ARG at the top of this
# file (required for use in FROM), so we just re-import it into this
# stage's scope before the FROM directive.
# ============================================================================
FROM ${BUILDER_TARGET} AS builder
FROM scratch

View File

@@ -24,6 +24,11 @@ service Backend {
rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
rpc Status(HealthMessage) returns (StatusResponse) {}
rpc Detect(DetectOptions) returns (DetectResponse) {}
rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
rpc VoiceAnalyze(VoiceAnalyzeRequest) returns (VoiceAnalyzeResponse) {}
rpc VoiceEmbed(VoiceEmbedRequest) returns (VoiceEmbedResponse) {}
rpc StoresSet(StoresSetOptions) returns (Result) {}
rpc StoresDelete(StoresDeleteOptions) returns (Result) {}
@@ -36,9 +41,19 @@ service Backend {
rpc VAD(VADRequest) returns (VADResponse) {}
rpc Diarize(DiarizeRequest) returns (DiarizeResponse) {}
rpc AudioEncode(AudioEncodeRequest) returns (AudioEncodeResult) {}
rpc AudioDecode(AudioDecodeRequest) returns (AudioDecodeResult) {}
rpc AudioTransform(AudioTransformRequest) returns (AudioTransformResult) {}
rpc AudioTransformStream(stream AudioTransformFrameRequest) returns (stream AudioTransformFrameResponse) {}
// AudioToAudioStream is the bidirectional any-to-any S2S RPC. Backends
// that load a speech-to-speech model consume input audio frames and emit
// interleaved audio + transcript + tool-call deltas as typed events.
// Backends without S2S support return UNIMPLEMENTED.
rpc AudioToAudioStream(stream AudioToAudioRequest) returns (stream AudioToAudioResponse) {}
rpc ModelMetadata(ModelOptions) returns (ModelMetadataResponse) {}
// Fine-tuning RPCs
@@ -305,6 +320,11 @@ message ModelOptions {
bool Reranking = 71;
repeated string Overrides = 72;
// EngineArgs carries a JSON-encoded map of backend-native engine arguments
// applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
// Unknown keys produce an error at LoadModel time.
string EngineArgs = 73;
}
message Result {
@@ -340,6 +360,12 @@ message TranscriptStreamResponse {
TranscriptResult final_result = 2;
}
message TranscriptWord {
int64 start = 1;
int64 end = 2;
string text = 3;
}
message TranscriptSegment {
int32 id = 1;
int64 start = 2;
@@ -347,6 +373,7 @@ message TranscriptSegment {
string text = 4;
repeated int32 tokens = 5;
string speaker = 6;
repeated TranscriptWord words = 7;
}
message GenerateImageRequest {
@@ -403,6 +430,43 @@ message VADResponse {
repeated VADSegment segments = 1;
}
// --- Speaker diarization messages ---
//
// Pure speaker diarization: "who spoke when". Returns time-stamped segments
// labelled with cluster IDs (the same string for the same speaker across
// segments). Some backends (e.g. vibevoice.cpp) produce diarization as a
// by-product of ASR and may also fill in `text` per segment; backends with a
// dedicated diarization pipeline (e.g. sherpa-onnx pyannote) leave `text`
// empty and emit only the segmentation.
message DiarizeRequest {
string dst = 1; // path to audio file (HTTP layer materialises uploads to a temp file)
uint32 threads = 2;
string language = 3; // optional; only meaningful for transcription-bundling backends
int32 num_speakers = 4; // exact speaker count if known (>0 forces); 0 = auto
int32 min_speakers = 5; // hint when auto-detecting; 0 = unset
int32 max_speakers = 6; // hint when auto-detecting; 0 = unset
float clustering_threshold = 7; // distance threshold when num_speakers unknown; 0 = backend default
float min_duration_on = 8; // discard segments shorter than this (seconds); 0 = backend default
float min_duration_off = 9; // merge gaps shorter than this (seconds); 0 = backend default
bool include_text = 10; // when the backend can emit per-segment transcript for free, ask it to populate `text`
}
message DiarizeSegment {
int32 id = 1;
float start = 2; // seconds
float end = 3; // seconds
string speaker = 4; // backend-emitted speaker label (e.g. "0", "SPEAKER_00")
string text = 5; // optional per-segment transcript (empty unless include_text and supported)
}
message DiarizeResponse {
repeated DiarizeSegment segments = 1;
int32 num_speakers = 2; // count of distinct speaker labels in `segments`
float duration = 3; // total audio duration in seconds (0 if unknown)
string language = 4; // optional, when the backend bundles transcription
}
message SoundGenerationRequest {
string text = 1;
string model = 2;
@@ -475,6 +539,112 @@ message DetectResponse {
repeated Detection Detections = 1;
}
// --- Face recognition messages ---
message FacialArea {
float x = 1;
float y = 2;
float w = 3;
float h = 4;
}
message FaceVerifyRequest {
string img1 = 1; // base64-encoded image
string img2 = 2; // base64-encoded image
float threshold = 3; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4; // run MiniFASNet liveness on each image; failed liveness forces verified=false
}
message FaceVerifyResponse {
bool verified = 1;
float distance = 2; // 1 - cosine_similarity
float threshold = 3;
float confidence = 4; // 0-100
string model = 5; // e.g. "buffalo_l"
FacialArea img1_area = 6;
FacialArea img2_area = 7;
float processing_time_ms = 8;
bool img1_is_real = 9; // anti-spoofing result when enabled
float img1_antispoof_score = 10;
bool img2_is_real = 11;
float img2_antispoof_score = 12;
}
message FaceAnalyzeRequest {
string img = 1; // base64-encoded image
repeated string actions = 2; // subset of ["age","gender","emotion","race"]; empty = all-supported
bool anti_spoofing = 3;
}
message FaceAnalysis {
FacialArea region = 1;
float face_confidence = 2;
float age = 3;
string dominant_gender = 4; // "Man" | "Woman"
map<string, float> gender = 5;
string dominant_emotion = 6; // reserved; empty in MVP
map<string, float> emotion = 7;
string dominant_race = 8; // not populated
map<string, float> race = 9;
bool is_real = 10; // anti-spoofing result when enabled
float antispoof_score = 11;
}
message FaceAnalyzeResponse {
repeated FaceAnalysis faces = 1;
}
// --- Voice (speaker) recognition messages ---
//
// Analogous to the Face* messages above, but for speaker biometrics.
// Audio fields accept a filesystem path (same convention as
// TranscriptRequest.dst). The HTTP layer materialises base64 / URL /
// data-URI inputs to a temp file before calling the gRPC backend.
message VoiceVerifyRequest {
string audio1 = 1; // path to first audio clip
string audio2 = 2; // path to second audio clip
float threshold = 3; // cosine-distance threshold; 0 = use backend default
bool anti_spoofing = 4; // reserved for future AASIST bolt-on
}
message VoiceVerifyResponse {
bool verified = 1;
float distance = 2; // 1 - cosine_similarity
float threshold = 3;
float confidence = 4; // 0-100
string model = 5; // e.g. "speechbrain/spkrec-ecapa-voxceleb"
float processing_time_ms = 6;
}
message VoiceAnalyzeRequest {
string audio = 1; // path to audio clip
repeated string actions = 2; // subset of ["age","gender","emotion"]; empty = all-supported
}
message VoiceAnalysis {
float start = 1; // segment start time in seconds (0 if single-utterance)
float end = 2; // segment end time in seconds
float age = 3;
string dominant_gender = 4;
map<string, float> gender = 5;
string dominant_emotion = 6;
map<string, float> emotion = 7;
}
message VoiceAnalyzeResponse {
repeated VoiceAnalysis segments = 1;
}
message VoiceEmbedRequest {
string audio = 1; // path to audio clip
}
message VoiceEmbedResponse {
repeated float embedding = 1;
string model = 2;
}
message ToolFormatMarkers {
string format_type = 1; // "json_native", "tag_with_json", "tag_with_tagged"
@@ -553,6 +723,143 @@ message AudioDecodeResult {
int32 samples_per_frame = 3;
}
// Generic audio transform: an audio-in, audio-out operation, optionally
// conditioned on a second reference signal. Concrete transforms include
// AEC + noise suppression + dereverberation (LocalVQE), voice conversion
// (reference = target speaker), pitch shifting, etc.
message AudioTransformRequest {
string audio_path = 1; // required, primary input file path
string reference_path = 2; // optional auxiliary; empty => zero-fill
string dst = 3; // required, output file path
map<string, string> params = 4; // backend-specific tuning
}
message AudioTransformResult {
string dst = 1;
int32 sample_rate = 2;
int32 samples = 3;
bool reference_provided = 4;
}
// Bidirectional streaming audio transform. The first message MUST carry a
// Config; subsequent messages carry Frames. A second Config mid-stream
// resets streaming state before the next frame.
message AudioTransformFrameRequest {
oneof payload {
AudioTransformStreamConfig config = 1;
AudioTransformFrame frame = 2;
}
}
message AudioTransformStreamConfig {
enum SampleFormat {
F32_LE = 0;
S16_LE = 1;
}
SampleFormat sample_format = 1;
int32 sample_rate = 2; // 0 => backend default
int32 frame_samples = 3; // 0 => backend default
map<string, string> params = 4;
bool reset = 5; // reset streaming state before next frame
}
message AudioTransformFrame {
bytes audio_pcm = 1; // frame_samples samples in stream's format
bytes reference_pcm = 2; // empty => zero-fill (silent reference)
}
message AudioTransformFrameResponse {
bytes pcm = 1;
int64 frame_index = 2;
}
// === AudioToAudioStream messages =========================================
//
// Bidirectional stream between the LocalAI core and an any-to-any audio
// model. The client opens the stream with a Config payload, then alternates
// Frame (input audio) and Control (turn boundaries, function-call results,
// session updates) payloads. The server streams back typed events: audio
// frames carry PCM in `pcm`; transcript / tool-call deltas carry JSON in
// `meta`; the stream ends with a `response.done` (success) or `error` event.
message AudioToAudioRequest {
oneof payload {
AudioToAudioConfig config = 1;
AudioToAudioFrame frame = 2;
AudioToAudioControl control = 3;
}
}
message AudioToAudioConfig {
// PCM format for client→server audio. 0 => backend default
// (16 kHz for the LFM2-Audio Conformer encoder).
int32 input_sample_rate = 1;
// Preferred server→client audio rate. 0 => backend default
// (24 kHz for the LFM2-Audio vocoder).
int32 output_sample_rate = 2;
// Optional system prompt override. Empty => backend chooses based on
// mode (e.g. "Respond with interleaved text and audio.").
string system_prompt = 3;
// Optional baked-voice id. Models that only ship a fixed set of
// voices (e.g. LFM2-Audio: us_male/us_female/uk_male/uk_female) match
// this against their voice table; an empty string keeps the default.
string voice = 4;
// JSON-encoded array of tool definitions in OpenAI Chat Completions
// format. Empty => no tools.
string tools = 5;
// Free-form sampling / decoding parameters (temperature, top_k,
// max_new_tokens, audio_top_k, etc).
map<string, string> params = 6;
// True => reset any session-scoped state before processing further
// frames on this stream. The first Config implicitly resets.
bool reset = 7;
}
message AudioToAudioFrame {
// Raw PCM s16le mono at config.input_sample_rate. Empty pcm + end_of_input
// is a valid "user finished speaking" marker without trailing audio.
bytes pcm = 1;
// Marks the last frame of a user turn. The backend may begin emitting
// a response immediately after seeing this.
bool end_of_input = 2;
}
message AudioToAudioControl {
// Free-form control event names. Initial set:
// "input_audio_buffer.commit" — user finished speaking
// "response.cancel" — abort in-flight generation
// "conversation.item.create" — inject a non-audio item (e.g.
// function_call_output as JSON in
// `payload`)
// "session.update" — re-configure mid-stream
string event = 1;
// Event-specific JSON payload.
bytes payload = 2;
}
message AudioToAudioResponse {
// Event identifies what this frame carries. Mirrors the OpenAI Realtime
// API server-event names where applicable. Initial set:
// "response.audio.delta"
// "response.audio_transcript.delta"
// "response.function_call_arguments.delta"
// "response.function_call_arguments.done"
// "response.done"
// "error"
string event = 1;
// Populated when event = response.audio.delta.
bytes pcm = 2;
// Populated alongside pcm to identify its rate. 0 => same as the
// session's negotiated output_sample_rate.
int32 sample_rate = 3;
// JSON payload for non-PCM events (transcript chunk, tool args, error
// body).
bytes meta = 4;
// Monotonic per-stream counter, useful for client reordering and
// debugging.
int64 sequence = 5;
}
message ModelMetadataResponse {
bool supports_thinking = 1;
string rendered_template = 2; // The rendered chat template with enable_thinking=true (empty if not applicable)

9
backend/cpp/ds4/.gitignore vendored Normal file
View File

@@ -0,0 +1,9 @@
ds4/
build/
package/
grpc-server
*.o
backend.pb.cc
backend.pb.h
backend.grpc.pb.cc
backend.grpc.pb.h

View File

@@ -0,0 +1,101 @@
cmake_minimum_required(VERSION 3.15)
project(ds4-grpc-server LANGUAGES CXX C)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(TARGET grpc-server)
option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")
find_package(Threads REQUIRED)
find_package(Protobuf CONFIG QUIET)
if(NOT Protobuf_FOUND)
find_package(Protobuf REQUIRED)
endif()
find_package(gRPC CONFIG QUIET)
if(NOT gRPC_FOUND)
# Ubuntu's apt-installed grpc++ does not ship a CMake config - fall back.
find_library(GRPCPP_LIB grpc++ REQUIRED)
find_library(GRPCPP_REFLECTION_LIB grpc++_reflection REQUIRED)
add_library(gRPC::grpc++ INTERFACE IMPORTED)
set_target_properties(gRPC::grpc++ PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_LIB}")
add_library(gRPC::grpc++_reflection INTERFACE IMPORTED)
set_target_properties(gRPC::grpc++_reflection PROPERTIES INTERFACE_LINK_LIBRARIES "${GRPCPP_REFLECTION_LIB}")
endif()
find_program(_PROTOC NAMES protoc REQUIRED)
find_program(_GRPC_CPP_PLUGIN NAMES grpc_cpp_plugin REQUIRED)
get_filename_component(HW_PROTO "${CMAKE_CURRENT_SOURCE_DIR}/../../backend.proto" ABSOLUTE)
get_filename_component(HW_PROTO_PATH "${HW_PROTO}" PATH)
set(HW_PROTO_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.cc")
set(HW_PROTO_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.pb.h")
set(HW_GRPC_SRCS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.cc")
set(HW_GRPC_HDRS "${CMAKE_CURRENT_BINARY_DIR}/backend.grpc.pb.h")
add_custom_command(
OUTPUT "${HW_PROTO_SRCS}" "${HW_PROTO_HDRS}" "${HW_GRPC_SRCS}" "${HW_GRPC_HDRS}"
COMMAND ${_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${HW_PROTO_PATH}"
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN}"
"${HW_PROTO}"
DEPENDS "${HW_PROTO}")
add_library(hw_grpc_proto STATIC
${HW_GRPC_SRCS} ${HW_GRPC_HDRS}
${HW_PROTO_SRCS} ${HW_PROTO_HDRS})
target_include_directories(hw_grpc_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
set(DS4_OBJS "${DS4_DIR}/ds4.o")
if(DS4_GPU STREQUAL "cuda")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_cuda.o")
elseif(DS4_GPU STREQUAL "metal")
list(APPEND DS4_OBJS "${DS4_DIR}/ds4_metal.o")
elseif(DS4_GPU STREQUAL "cpu")
set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
endif()
add_executable(${TARGET}
grpc-server.cpp
dsml_parser.cpp
dsml_renderer.cpp
kv_cache.cpp)
target_include_directories(${TARGET} PRIVATE ${DS4_DIR})
foreach(obj ${DS4_OBJS})
target_sources(${TARGET} PRIVATE ${obj})
set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
endforeach()
target_link_libraries(${TARGET} PRIVATE
hw_grpc_proto
gRPC::grpc++
gRPC::grpc++_reflection
protobuf::libprotobuf
Threads::Threads
m)
if(DS4_GPU STREQUAL "cuda")
find_package(CUDAToolkit REQUIRED)
target_link_libraries(${TARGET} PRIVATE CUDA::cudart CUDA::cublas)
elseif(DS4_GPU STREQUAL "metal")
find_library(FOUNDATION_LIB Foundation REQUIRED)
find_library(METAL_LIB Metal REQUIRED)
target_link_libraries(${TARGET} PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
elseif(DS4_GPU STREQUAL "cpu")
target_compile_definitions(${TARGET} PRIVATE DS4_NO_GPU)
endif()
if(DS4_NATIVE)
if(APPLE)
target_compile_options(${TARGET} PRIVATE -mcpu=native)
else()
target_compile_options(${TARGET} PRIVATE -march=native)
endif()
endif()

78
backend/cpp/ds4/Makefile Normal file
View File

@@ -0,0 +1,78 @@
# ds4 backend Makefile.
#
# Upstream pin lives below as DS4_VERSION?=c9dd9499bfa57c1bbfbb4446eff963330ab5329b
# (.github/bump_deps.sh) can find and update it - matches the
# llama-cpp / ik-llama-cpp / turboquant convention.
DS4_VERSION?=c9dd9499bfa57c1bbfbb4446eff963330ab5329b
DS4_REPO?=https://github.com/antirez/ds4
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
BUILD_DIR := build
BUILD_TYPE ?=
NATIVE ?= false
JOBS ?= $(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
UNAME_S := $(shell uname -s)
CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release
ifeq ($(BUILD_TYPE),cublas)
CMAKE_ARGS += -DDS4_GPU=cuda
DS4_OBJ_TARGET := ds4.o ds4_cuda.o
else ifeq ($(UNAME_S),Darwin)
CMAKE_ARGS += -DDS4_GPU=metal
DS4_OBJ_TARGET := ds4.o ds4_metal.o
else
# CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
CMAKE_ARGS += -DDS4_GPU=cpu
DS4_OBJ_TARGET := ds4_cpu.o
endif
ifneq ($(NATIVE),true)
CMAKE_ARGS += -DDS4_NATIVE=OFF
endif
.PHONY: grpc-server package clean purge test all
all: grpc-server
# Clone the upstream ds4 source at the pinned commit. Directory acts as the
# target so make only re-clones when missing. After a DS4_VERSION bump,
# run 'make purge && make' to refetch (or rely on CI's clean build).
ds4:
mkdir -p ds4
cd ds4 && \
git init -q && \
git remote add origin $(DS4_REPO) && \
git fetch --depth 1 origin $(DS4_VERSION) && \
git checkout FETCH_HEAD
# Build ds4's engine object files via its own Makefile, which already encodes
# the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
ds4/ds4.o: ds4
ifeq ($(BUILD_TYPE),cublas)
+$(MAKE) -C ds4 ds4.o ds4_cuda.o
else ifeq ($(UNAME_S),Darwin)
+$(MAKE) -C ds4 ds4.o ds4_metal.o
else
+$(MAKE) -C ds4 ds4_cpu.o
endif
grpc-server: ds4/ds4.o
mkdir -p $(BUILD_DIR)
cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
cp $(BUILD_DIR)/grpc-server grpc-server
package: grpc-server
bash package.sh
test:
@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"
clean:
rm -rf $(BUILD_DIR) grpc-server package
if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi
purge: clean
rm -rf ds4

View File

@@ -0,0 +1,359 @@
#include "dsml_parser.h"
#include <algorithm>
#include <cstdio>
#include <cstring>
#include <chrono>
#include <random>
#include <string>
#include <vector>
namespace ds4cpp {
namespace {
constexpr const char *kThinkOpen = "<think>";
constexpr const char *kThinkClose = "</think>";
constexpr const char *kToolsOpen = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // <DSMLtool_calls>
constexpr const char *kToolsClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>"; // </DSMLtool_calls>
constexpr const char *kInvokeOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\""; // <DSMLinvoke name="
constexpr const char *kInvokeClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>"; // </DSMLinvoke>
constexpr const char *kParamOpenPfx = "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\""; // <DSMLparameter name="
constexpr const char *kParamClose = "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>"; // </DSMLparameter>
// All structural markers the parser might encounter - used to detect "buf
// might be a partial marker, don't drain yet" conditions.
const std::vector<std::string> &all_markers() {
static const std::vector<std::string> v = {
kThinkOpen, kThinkClose,
kToolsOpen, kToolsClose,
kInvokeOpenPfx, kInvokeClose,
kParamOpenPfx, kParamClose,
};
return v;
}
// Returns true if `buf` could be a *prefix* of any marker (i.e., we should
// wait for more text before draining as plain content). The marker-prefix
// loop handles fixed markers exactly. For markers with variable-length
// internal data (kInvokeOpenPfx, kParamOpenPfx have an open quote, then the
// tool/param name, then a closing quote and `>`), we also wait while buf
// starts with `<` and has not yet seen a `>`: the leading `<` could be the
// start of one of those open markers, or a literal that we can confirm only
// once we know what follows. Anything after the first `>` arrives is either
// consumed by TryConsumeMarker or emitted as a literal `<` by the caller.
bool looks_like_prefix(const std::string &buf) {
for (const auto &m : all_markers()) {
if (m.size() > buf.size() && m.compare(0, buf.size(), buf) == 0) return true;
}
if (!buf.empty() && buf[0] == '<' && buf.find('>') == std::string::npos) {
return true;
}
return false;
}
bool consume_literal(std::string &buf, const std::string &lit) {
if (buf.compare(0, lit.size(), lit) == 0) {
buf.erase(0, lit.size());
return true;
}
return false;
}
// Find the next '<' in buf starting at offset; returns std::string::npos if none.
size_t next_tag(const std::string &buf, size_t off = 0) {
return buf.find('<', off);
}
std::string json_escape(const std::string &in) {
std::string out;
out.reserve(in.size() + 2);
for (char c : in) {
switch (c) {
case '"': out += "\\\""; break;
case '\\': out += "\\\\"; break;
case '\b': out += "\\b"; break;
case '\f': out += "\\f"; break;
case '\n': out += "\\n"; break;
case '\r': out += "\\r"; break;
case '\t': out += "\\t"; break;
default:
if (static_cast<unsigned char>(c) < 0x20) {
char tmp[8];
std::snprintf(tmp, sizeof(tmp), "\\u%04x", c);
out += tmp;
} else {
out += c;
}
}
}
return out;
}
} // namespace
DsmlParser::DsmlParser() = default;
bool DsmlParser::IsInDsmlStructural() const {
switch (state_) {
case State::TOOL_CALLS:
case State::INVOKE:
return true;
case State::PARAM_VALUE: // payload bytes; user sampling applies
case State::TEXT:
case State::THINK:
return false;
}
return false;
}
void DsmlParser::EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out) {
if (chunk.empty()) return;
ParserEvent e;
e.type = ParserEvent::TOOL_ARGS;
e.text = chunk;
e.index = tool_index_;
out.push_back(std::move(e));
}
void DsmlParser::FinishCurrentToolCall(std::vector<ParserEvent> &out) {
if (tool_index_ < 0) return;
// Close the JSON object that was opened on the first parameter.
if (args_emitted_open_brace_) {
EmitArgsChunk("}", out);
} else {
EmitArgsChunk("{}", out);
}
ParserEvent e;
e.type = ParserEvent::TOOL_END;
e.index = tool_index_;
out.push_back(std::move(e));
current_tool_name_.clear();
args_emitted_open_brace_ = false;
args_param_count_ = 0;
}
bool DsmlParser::TryConsumeMarker(std::vector<ParserEvent> &out) {
switch (state_) {
case State::TEXT: {
if (consume_literal(buf_, kThinkOpen)) { state_ = State::THINK; return true; }
if (consume_literal(buf_, kToolsOpen)) { state_ = State::TOOL_CALLS; return true; }
return false;
}
case State::THINK: {
if (consume_literal(buf_, kThinkClose)) { state_ = State::TEXT; return true; }
return false;
}
case State::TOOL_CALLS: {
if (consume_literal(buf_, kToolsClose)) { state_ = State::TEXT; return true; }
// <DSMLinvoke name="X">
if (buf_.compare(0, std::strlen(kInvokeOpenPfx), kInvokeOpenPfx) == 0) {
size_t close_q = buf_.find('"', std::strlen(kInvokeOpenPfx));
if (close_q == std::string::npos) return false; // need more bytes
size_t close_gt = buf_.find('>', close_q);
if (close_gt == std::string::npos) return false;
current_tool_name_ = buf_.substr(std::strlen(kInvokeOpenPfx),
close_q - std::strlen(kInvokeOpenPfx));
tool_index_++;
buf_.erase(0, close_gt + 1);
ParserEvent e;
e.type = ParserEvent::TOOL_START;
e.tool_name = current_tool_name_;
e.tool_id = RandomToolId();
e.index = tool_index_;
out.push_back(std::move(e));
args_emitted_open_brace_ = false;
args_param_count_ = 0;
state_ = State::INVOKE;
return true;
}
return false;
}
case State::INVOKE: {
if (consume_literal(buf_, kInvokeClose)) {
FinishCurrentToolCall(out);
state_ = State::TOOL_CALLS;
return true;
}
// <DSMLparameter name="K" string="true|false">
if (buf_.compare(0, std::strlen(kParamOpenPfx), kParamOpenPfx) == 0) {
size_t close_q = buf_.find('"', std::strlen(kParamOpenPfx));
if (close_q == std::string::npos) return false;
size_t string_attr = buf_.find("string=\"", close_q);
if (string_attr == std::string::npos) return false;
size_t string_q = buf_.find('"', string_attr + 8);
if (string_q == std::string::npos) return false;
size_t close_gt = buf_.find('>', string_q);
if (close_gt == std::string::npos) return false;
param_name_ = buf_.substr(std::strlen(kParamOpenPfx),
close_q - std::strlen(kParamOpenPfx));
std::string string_val = buf_.substr(string_attr + 8,
string_q - (string_attr + 8));
param_is_string_ = (string_val == "true");
param_value_.clear();
buf_.erase(0, close_gt + 1);
// Emit args JSON opener / separator.
std::string opener;
if (!args_emitted_open_brace_) { opener = "{"; args_emitted_open_brace_ = true; }
else { opener = ","; }
opener += "\"" + json_escape(param_name_) + "\":";
if (param_is_string_) opener += "\"";
EmitArgsChunk(opener, out);
args_param_count_++;
state_ = State::PARAM_VALUE;
return true;
}
return false;
}
case State::PARAM_VALUE: {
if (consume_literal(buf_, kParamClose)) {
if (param_is_string_) EmitArgsChunk("\"", out);
state_ = State::INVOKE;
return true;
}
return false;
}
}
return false;
}
void DsmlParser::DrainPlain(std::vector<ParserEvent> &out) {
// Drain everything up to the next '<' that *might* start a marker.
// Anything before the next '<' is safe to emit; the '<...' tail stays buffered.
while (!buf_.empty()) {
size_t lt = next_tag(buf_, 0);
if (lt == std::string::npos) {
// No tag at all - emit (or accumulate) the whole buffer.
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(buf_) : buf_;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = buf_;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = buf_;
out.push_back(std::move(e));
}
// Inside INVOKE / TOOL_CALLS with no marker, raw bytes are
// structural whitespace - discard.
buf_.clear();
return;
}
if (lt > 0) {
std::string chunk = buf_.substr(0, lt);
buf_.erase(0, lt);
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = chunk;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = chunk;
out.push_back(std::move(e));
}
}
// buf_[0] == '<' - try consuming a marker. If we consumed one, loop again.
if (!TryConsumeMarker(out)) {
// Could be a partial marker - wait for more bytes.
if (looks_like_prefix(buf_)) return;
// Otherwise this '<' is a literal - emit one char and continue.
std::string one(1, buf_[0]);
buf_.erase(0, 1);
ParserEvent e;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(one) : one;
EmitArgsChunk(esc, out);
} else if (state_ == State::THINK) {
e.type = ParserEvent::REASONING;
e.text = one;
out.push_back(std::move(e));
} else if (state_ == State::TEXT) {
e.type = ParserEvent::CONTENT;
e.text = one;
out.push_back(std::move(e));
}
}
}
}
void DsmlParser::Feed(const std::string &chunk, std::vector<ParserEvent> &out) {
buf_ += chunk;
DrainPlain(out);
}
void DsmlParser::Flush(std::vector<ParserEvent> &out) {
// At flush time we no longer wait for marker completion - drain everything
// (the trailing bytes won't grow). Mirror DrainPlain's state-aware
// classification: PARAM_VALUE bytes become TOOL_ARGS, THINK bytes become
// REASONING, TEXT bytes become CONTENT, and INVOKE/TOOL_CALLS bytes are
// structural whitespace (discarded).
auto emit_plain = [&](const std::string &chunk) {
if (chunk.empty()) return;
if (state_ == State::PARAM_VALUE) {
std::string esc = param_is_string_ ? json_escape(chunk) : chunk;
EmitArgsChunk(esc, out);
return;
}
if (state_ == State::THINK) {
ParserEvent e;
e.type = ParserEvent::REASONING;
e.text = chunk;
out.push_back(std::move(e));
return;
}
if (state_ == State::TEXT) {
ParserEvent e;
e.type = ParserEvent::CONTENT;
e.text = chunk;
out.push_back(std::move(e));
return;
}
// INVOKE / TOOL_CALLS: structural whitespace, discard.
};
while (!buf_.empty()) {
size_t lt = next_tag(buf_, 0);
if (lt == std::string::npos) {
emit_plain(buf_);
buf_.clear();
return;
}
if (lt > 0) {
std::string chunk = buf_.substr(0, lt);
buf_.erase(0, lt);
emit_plain(chunk);
}
if (!TryConsumeMarker(out)) {
// Definitely a literal '<' now (no chance of more bytes arriving).
std::string one(1, buf_[0]);
buf_.erase(0, 1);
emit_plain(one);
}
}
// If we ended mid-tool-call (model truncated), close it cleanly.
if (state_ == State::INVOKE || state_ == State::PARAM_VALUE) {
if (state_ == State::PARAM_VALUE && param_is_string_) EmitArgsChunk("\"", out);
FinishCurrentToolCall(out);
state_ = State::TEXT;
}
}
std::string RandomToolId() {
static thread_local std::mt19937_64 rng{
static_cast<uint64_t>(std::chrono::system_clock::now().time_since_epoch().count())};
const char *alphabet =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
std::string out = "call_";
for (int i = 0; i < 16; ++i) {
out += alphabet[rng() % 62];
}
return out;
}
} // namespace ds4cpp

View File

@@ -0,0 +1,77 @@
#pragma once
#include <functional>
#include <string>
#include <vector>
namespace ds4cpp {
struct ParserEvent {
enum Type { CONTENT, REASONING, TOOL_START, TOOL_ARGS, TOOL_END };
Type type;
std::string text; // CONTENT, REASONING, TOOL_ARGS
std::string tool_name; // TOOL_START
std::string tool_id; // TOOL_START (caller-assigned)
int index = 0; // TOOL_START / TOOL_ARGS / TOOL_END
};
// Streaming parser. Stateless across instances; one per Predict call.
class DsmlParser {
public:
DsmlParser();
// Feed a chunk of raw model-emitted text. Appends classified events to
// `out`. May buffer the tail of `chunk` internally if it looks like a
// marker prefix.
void Feed(const std::string &chunk, std::vector<ParserEvent> &out);
// Flush any remaining buffered text as CONTENT (called at generation end).
void Flush(std::vector<ParserEvent> &out);
// True when the parser is inside a DSML structural position - that is,
// tags/markers between tool-call boundaries where the model is expected
// to emit protocol bytes verbatim. Mirrors ds4_server.c's "force
// temperature=0 unless dsml_decode_state_uses_payload_sampling" rule:
//
// TEXT / THINK -> false (user sampling applies)
// PARAM_VALUE -> false (payload uses user sampling)
// TOOL_CALLS / INVOKE -> true (structural; force greedy)
//
// Callers should use this BEFORE the next sample() call to pick the
// effective temperature; the parser's state reflects what's already
// been consumed, so it predicts the next token's classification.
bool IsInDsmlStructural() const;
private:
enum class State { TEXT, THINK, TOOL_CALLS, INVOKE, PARAM_VALUE };
State state_ = State::TEXT;
std::string buf_;
std::string current_tool_name_;
int tool_index_ = -1;
// While parsing a parameter value:
std::string param_name_;
bool param_is_string_ = true;
std::string param_value_;
// Incrementally-built arguments JSON for the active tool call.
std::string args_json_so_far_;
bool args_emitted_open_brace_ = false;
int args_param_count_ = 0;
// Try to consume one structural marker starting at buf_[0]. Returns true
// and advances state if a complete marker was consumed; false if the
// buffer is ambiguous (could be a marker prefix).
bool TryConsumeMarker(std::vector<ParserEvent> &out);
// Drain plain text from buf_ as far as we're sure it's not a marker prefix.
// Emits CONTENT or REASONING depending on current state.
void DrainPlain(std::vector<ParserEvent> &out);
// Emit the next chunk of arguments JSON to the consumer.
void EmitArgsChunk(const std::string &chunk, std::vector<ParserEvent> &out);
void FinishCurrentToolCall(std::vector<ParserEvent> &out);
};
// Generate a random tool call ID (e.g. "call_AbCdEf"). Used by the gRPC layer
// when assigning IDs to streamed tool calls.
std::string RandomToolId();
} // namespace ds4cpp

View File

@@ -0,0 +1,140 @@
#include "dsml_renderer.h"
// We accept either nlohmann::json (if available) or fall back to a tiny
// hand-rolled parser. The LocalAI tree already has nlohmann/json bundled
// in vendor paths; we use the apt-installed nlohmann-json3-dev (installed
// in Task 11 step 1) when present, otherwise the bundled copy.
#if __has_include(<nlohmann/json.hpp>)
#include <nlohmann/json.hpp>
using json = nlohmann::json;
#else
#error "nlohmann/json.hpp not found; install nlohmann-json3-dev"
#endif
#include <sstream>
namespace ds4cpp {
namespace {
void render_param(std::ostringstream &os, const std::string &name,
const json &value) {
bool is_string = value.is_string();
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"" << name
<< "\" string=\"" << (is_string ? "true" : "false") << "\">";
if (is_string) {
os << value.get<std::string>();
} else {
os << value.dump();
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n";
}
} // namespace
std::string RenderAssistantToolCalls(const std::string &tool_calls_json) {
if (tool_calls_json.empty()) return "";
json arr;
try {
arr = json::parse(tool_calls_json);
} catch (const std::exception &) {
return "";
}
if (!arr.is_array() || arr.empty()) return "";
std::ostringstream os;
os << "\n\n<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n";
for (const auto &call : arr) {
// OpenAI shape: { id, type, function: { name, arguments (JSON string) } }
// Anthropic shape comes through normalized by LocalAI.
std::string name;
std::string args_str;
if (call.contains("function")) {
const auto &fn = call["function"];
if (fn.contains("name") && fn["name"].is_string())
name = fn["name"].get<std::string>();
if (fn.contains("arguments") && fn["arguments"].is_string())
args_str = fn["arguments"].get<std::string>();
}
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"" << name << "\">\n";
if (!args_str.empty()) {
json args;
try {
args = json::parse(args_str);
} catch (...) {
args = json{};
}
if (args.is_object()) {
for (auto it = args.begin(); it != args.end(); ++it) {
render_param(os, it.key(), it.value());
}
}
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n";
}
os << "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>";
return os.str();
}
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content) {
std::ostringstream os;
// ds4_server.c wraps tool results in a "tool_result" DSML tag carrying
// the tool_call_id. Match that shape.
os << "<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result id=\"" << tool_call_id << "\">"
<< content
<< "</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_result>";
return os.str();
}
std::string RenderToolsManifest(const std::string &tools_json) {
if (tools_json.empty()) return "";
json arr;
try {
arr = json::parse(tools_json);
} catch (const std::exception &) {
return "";
}
if (!arr.is_array() || arr.empty()) return "";
// Extract each OpenAI tool's `function` object, dump as compact JSON, one
// per line. Mirrors openai_function_schema_from_tool() in ds4_server.c.
std::ostringstream schemas;
for (const auto &tool : arr) {
if (tool.contains("function") && tool["function"].is_object()) {
schemas << tool["function"].dump() << "\n";
} else if (tool.is_object()) {
// Anthropic / direct-schema form: pass through.
schemas << tool.dump() << "\n";
}
}
if (schemas.tellp() == std::streampos(0)) return "";
// Verbatim text from ds4_server.c append_tools_prompt_text. Do NOT
// paraphrase - the model was trained on these exact bytes.
std::ostringstream os;
os << "## Tools\n\n"
"You have access to a set of tools to help answer the user question. "
"You can invoke tools by writing a \"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\" block like the following:\n\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME\">\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter name=\"$PARAMETER_NAME\" string=\"true|false\">$PARAMETER_VALUE</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>\n"
"...\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
"<\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke name=\"$TOOL_NAME2\">\n"
"...\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "invoke>\n"
"</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "tool_calls>\n\n"
"String parameters should be specified as raw text and set `string=\"true\"`. "
"Preserve characters such as `>`, `&`, and `&&` exactly; never replace normal string characters with XML or HTML entity escapes. "
"Only if a string value itself contains the exact closing parameter tag `</\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>`, write that tag as `&lt;/\xef\xbd\x9c" "DSML\xef\xbd\x9c" "parameter>` inside the value. "
"For all other types (numbers, booleans, arrays, objects), pass the value in JSON format and set `string=\"false\"`.\n\n"
"If thinking_mode is enabled (triggered by <think>), you MUST output your complete reasoning inside <think>...</think> BEFORE any tool calls or final response.\n\n"
"Otherwise, output directly after </think> with tool calls or final response.\n\n"
"### Available Tool Schemas\n\n"
<< schemas.str()
<< "\nYou MUST strictly follow the above defined tool name and parameter schemas to invoke tool calls. "
"Use the exact parameter names from the schemas.";
return os.str();
}
} // namespace ds4cpp

View File

@@ -0,0 +1,27 @@
#pragma once
#include <string>
namespace ds4cpp {
// Render an assistant message's tool_calls JSON array into the DSML block
// that ds4 expects in its prompt. `tool_calls_json` is the value of
// proto.Message.tool_calls (OpenAI shape: array of {id, type, function:{name, arguments}}).
// Returns the DSML text to append after the assistant's content.
std::string RenderAssistantToolCalls(const std::string &tool_calls_json);
// Render a role="tool" message into the DSML "tool result" block. ds4's
// prompt template expects tool results inside a specific tag; we wrap the
// `content` with that tag and include the `tool_call_id` so the model can
// correlate.
std::string RenderToolResult(const std::string &tool_call_id, const std::string &content);
// Render the "## Tools" manifest that ds4 expects in the SYSTEM prompt when
// tools are available. Without this preamble the model has no idea tools
// exist and will not emit DSML tool calls. Mirrors append_tools_prompt_text()
// in ds4_server.c (~line 1646): a fixed preamble + "### Available Tool
// Schemas" section + one JSON schema per line (extracted from each OpenAI
// tool's .function object) + a fixed closing instruction. Returns empty
// when tools_json is empty / unparseable.
std::string RenderToolsManifest(const std::string &tools_json);
} // namespace ds4cpp

View File

@@ -0,0 +1,696 @@
// ds4 LocalAI gRPC backend.
//
// Wraps antirez/ds4's `ds4_engine_*` / `ds4_session_*` public API
// (see ds4/ds4.h) over LocalAI's backend.proto. Tool calls, thinking
// mode, and disk KV cache are wired in follow-up commits; this commit
// is just the bind/listen/Health/Free skeleton.
#include "backend.pb.h"
#include "backend.grpc.pb.h"
#include "dsml_parser.h" // populated in Task 12
#include "dsml_renderer.h" // populated in Task 16
#include "kv_cache.h" // populated in Task 17
extern "C" {
#include "ds4.h"
}
#include <grpcpp/grpcpp.h>
#include <grpcpp/server.h>
#include <grpcpp/server_builder.h>
#include <grpcpp/ext/proto_server_reflection_plugin.h>
#include <atomic>
#include <chrono>
#include <csignal>
#include <cstring>
#include <iostream>
#include <memory>
#include <mutex>
#include <string>
#include <thread>
#include <vector>
using grpc::Server;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::ServerWriter;
// NOTE: do NOT alias `grpc::Status` as `Status` - the Status RPC method below
// would shadow the type, breaking the other RPC method declarations that use
// it as a return type. Use GStatus instead.
using GStatus = ::grpc::Status;
using grpc::StatusCode;
namespace {
// Global state - ds4 is single-engine-per-process by design.
std::mutex g_engine_mu;
ds4_engine *g_engine = nullptr;
ds4_session *g_session = nullptr;
int g_ctx_size = 32768;
std::string g_kv_cache_dir; // empty disables disk cache
std::atomic<Server *> g_server{nullptr};
// Parse a "key:value" option string. Returns empty when no colon.
static std::pair<std::string, std::string> split_option(const std::string &opt) {
auto colon = opt.find(':');
if (colon == std::string::npos) return {opt, ""};
return {opt.substr(0, colon), opt.substr(colon + 1)};
}
static void append_token_text(ds4_engine *engine, int token, std::string &out) {
size_t len = 0;
const char *text = ds4_token_text(engine, token, &len);
if (text && len > 0) out.append(text, len);
}
struct CollectCtx {
ds4_engine *engine;
std::string raw_buf; // exact raw bytes for Reply.message
ds4cpp::DsmlParser parser;
backend::Reply *reply;
int tokens;
// Per-tool aggregation: accumulate ChatDelta tool_calls so we emit one
// delta with all calls, mirroring how vllm's non-streaming path returns.
struct Pending {
std::string id;
std::string name;
std::string args;
};
std::vector<Pending> pending;
std::string content_buf;
std::string reasoning_buf;
};
static void apply_events(CollectCtx *c, const std::vector<ds4cpp::ParserEvent> &events) {
for (const auto &e : events) {
switch (e.type) {
case ds4cpp::ParserEvent::CONTENT:
c->content_buf += e.text;
break;
case ds4cpp::ParserEvent::REASONING:
c->reasoning_buf += e.text;
break;
case ds4cpp::ParserEvent::TOOL_START:
if ((int)c->pending.size() <= e.index)
c->pending.resize(e.index + 1);
c->pending[e.index].id = e.tool_id;
c->pending[e.index].name = e.tool_name;
break;
case ds4cpp::ParserEvent::TOOL_ARGS:
if ((int)c->pending.size() > e.index)
c->pending[e.index].args += e.text;
break;
case ds4cpp::ParserEvent::TOOL_END:
// No-op for non-streaming: the final delta is emitted at the end.
break;
}
}
}
static void collect_emit(void *ud, int token) {
auto *c = static_cast<CollectCtx *>(ud);
if (token == ds4_token_eos(c->engine)) return;
size_t len = 0;
const char *text = ds4_token_text(c->engine, token, &len);
if (!text || len == 0) return;
std::string chunk(text, len);
c->raw_buf += chunk;
std::vector<ds4cpp::ParserEvent> events;
c->parser.Feed(chunk, events);
apply_events(c, events);
c->tokens++;
}
static void collect_done(void *) {}
struct StreamCtx {
ds4_engine *engine;
ServerWriter<backend::Reply> *writer;
ds4cpp::DsmlParser parser;
int tokens;
bool aborted;
// Track which tool indices we've seen TOOL_START for, so subsequent
// ARGS deltas can elide the redundant id/name fields.
std::vector<bool> tool_started;
};
static void stream_emit(void *ud, int token) {
auto *s = static_cast<StreamCtx *>(ud);
if (s->aborted) return;
if (token == ds4_token_eos(s->engine)) return;
size_t len = 0;
const char *text = ds4_token_text(s->engine, token, &len);
if (!text || len == 0) return;
std::string chunk(text, len);
std::vector<ds4cpp::ParserEvent> events;
s->parser.Feed(chunk, events);
if (events.empty()) { s->tokens++; return; }
backend::Reply reply;
auto *delta = reply.add_chat_deltas();
bool any_field = false;
for (const auto &e : events) {
switch (e.type) {
case ds4cpp::ParserEvent::CONTENT:
delta->set_content(delta->content() + e.text);
any_field = true;
break;
case ds4cpp::ParserEvent::REASONING:
delta->set_reasoning_content(delta->reasoning_content() + e.text);
any_field = true;
break;
case ds4cpp::ParserEvent::TOOL_START: {
if ((int)s->tool_started.size() <= e.index)
s->tool_started.resize(e.index + 1, false);
s->tool_started[e.index] = true;
auto *tc = delta->add_tool_calls();
tc->set_index(e.index);
tc->set_id(e.tool_id);
tc->set_name(e.tool_name);
any_field = true;
break;
}
case ds4cpp::ParserEvent::TOOL_ARGS: {
auto *tc = delta->add_tool_calls();
tc->set_index(e.index);
tc->set_arguments(e.text);
any_field = true;
break;
}
case ds4cpp::ParserEvent::TOOL_END:
// No marker delta needed - the Go side closes the tool call on
// the final aggregator pass.
break;
}
}
reply.set_message(chunk);
reply.set_tokens(1);
if (any_field) {
if (!s->writer->Write(reply)) s->aborted = true;
}
s->tokens++;
}
static void stream_done(void *) {}
// Per-thread RNG seed for ds4_session_sample. Initialized lazily from
// system_clock; ds4 owns the random walk after that.
static uint64_t *get_rng() {
static thread_local uint64_t seed = 0;
if (seed == 0) {
seed = static_cast<uint64_t>(
std::chrono::system_clock::now().time_since_epoch().count());
if (seed == 0) seed = 1;
}
return &seed;
}
struct SampleParams {
float temperature;
int top_k;
float top_p;
float min_p;
};
// Compute the effective sampling parameters for the next token, mirroring
// ds4_server.c:7102-7115:
// - thinking mode enabled -> override (T=1, top_k=0, top_p=1, min_p=0)
// - inside DSML structural position (tool-call markers) -> force T=0
// - otherwise -> the request's user-supplied sampling settings
// The parser argument carries state from tokens emitted so far; its
// IsInDsmlStructural() predicts the next token's classification.
static SampleParams compute_sample_params(const backend::PredictOptions *request,
const ds4cpp::DsmlParser &parser,
bool think_enabled);
static ds4_think_mode parse_think_mode(const backend::PredictOptions *request) {
// Per the vllm backend convention, "enable_thinking" gates thinking on/off,
// and "reasoning_effort" picks the strength when on.
const auto &md = request->metadata();
auto et = md.find("enable_thinking");
bool enabled = true; // default ON per ds4-server
if (et != md.end()) enabled = (et->second == "true" || et->second == "1");
if (!enabled) return DS4_THINK_NONE;
auto re = md.find("reasoning_effort");
if (re != md.end() && (re->second == "max" || re->second == "xhigh"))
return DS4_THINK_MAX;
return DS4_THINK_HIGH;
}
static SampleParams compute_sample_params(const backend::PredictOptions *request,
const ds4cpp::DsmlParser &parser,
bool think_enabled) {
SampleParams p = {
request->temperature(),
request->topk(),
request->topp(),
request->minp(),
};
if (think_enabled) {
// Match ds4-server: thinking mode wants creativity in the reasoning
// pass and the trailing content, so the entire generation overrides
// sampling unless DSML structural bytes take over below.
p.temperature = 1.0f;
p.top_k = 0;
p.top_p = 1.0f;
p.min_p = 0.0f;
}
if (parser.IsInDsmlStructural()) {
// Tool-call structural bytes (tags, markers, headers) must parse
// cleanly. Force greedy regardless of user/thinking settings.
p.temperature = 0.0f;
}
return p;
}
// Build the rendered text for cache keying. We feed the same text the model
// will see; that lets the cache survive small client-side reformatting of
// chat history (the cache is keyed on bytes, not tokens).
static std::string render_prompt_text(const backend::PredictOptions *request) {
// Two-mode: either the raw prompt or the chat-template path. We mirror
// build_prompt's branching but accumulate text (not tokens) so we can
// SHA1 it for the cache key. ds4_session caches a tokens-indexed
// checkpoint, but the disk format keys on bytes per ds4-server's design.
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
return request->prompt();
}
std::string out;
const std::string sys_role = "system";
for (const auto &m : request->messages()) {
if (m.role() == sys_role) { out += "[sys] " + m.content() + "\n"; break; }
}
for (const auto &m : request->messages()) {
if (m.role() == sys_role) continue;
out += "[" + m.role() + "] " + m.content() + "\n";
}
return out;
}
ds4cpp::KvCache g_kv_cache;
// Try to recover prefill state for `rendered`. Returns the matched prefix length.
static size_t maybe_load_cache(const std::string &rendered) {
if (!g_kv_cache.enabled() || !g_session) return 0;
return g_kv_cache.LoadLongestPrefix(g_session, rendered, g_ctx_size);
}
static void maybe_save_cache(const std::string &rendered) {
if (g_kv_cache.enabled() && g_session) {
g_kv_cache.Save(g_session, rendered, g_ctx_size);
}
}
static void build_prompt(ds4_engine *engine, const backend::PredictOptions *request,
ds4_tokens *out) {
if (!request->usetokenizertemplate() || request->messages_size() == 0) {
ds4_tokenize_text(engine, request->prompt().c_str(), out);
return;
}
// Chat-template path: render via ds4's helpers.
ds4_chat_begin(engine, out);
ds4_think_mode think = parse_think_mode(request);
// ds4_encode_chat_prompt is convenient when there is exactly one
// system+user pair, but for arbitrary turn lists we use the granular
// append helpers. Pull the first system message (if any), then append
// every other message in order.
const std::string sys_role = "system";
std::string system_text;
for (const auto &m : request->messages()) {
if (m.role() == sys_role) { system_text = m.content(); break; }
}
// Inject the tools manifest into the system prompt when tools are present.
// ds4 was trained to emit DSML tool calls ONLY when this preamble is in
// the system message - without it, the model has no idea tools exist and
// the e2e tool-call test will fail. The renderer lives in dsml_renderer
// and is a verbatim port of ds4_server.c's append_tools_prompt_text.
std::string tools_manifest;
if (!request->tools().empty()) {
tools_manifest = ds4cpp::RenderToolsManifest(request->tools());
}
if (!system_text.empty() || !tools_manifest.empty()) {
std::string combined = system_text;
if (!tools_manifest.empty()) {
if (!combined.empty()) combined += "\n\n";
combined += tools_manifest;
}
ds4_chat_append_message(engine, out, "system", combined.c_str());
}
for (const auto &m : request->messages()) {
if (m.role() == sys_role) continue;
if (m.role() == "assistant" && !m.tool_calls().empty()) {
std::string combined = m.content();
combined += ds4cpp::RenderAssistantToolCalls(m.tool_calls());
ds4_chat_append_message(engine, out, "assistant", combined.c_str());
} else if (m.role() == "tool") {
std::string body = ds4cpp::RenderToolResult(m.tool_call_id(), m.content());
ds4_chat_append_message(engine, out, "user", body.c_str());
} else {
ds4_chat_append_message(engine, out, m.role().c_str(), m.content().c_str());
}
}
ds4_chat_append_assistant_prefix(engine, out, think);
}
class DS4Backend final : public backend::Backend::Service {
public:
GStatus Health(ServerContext *, const backend::HealthMessage *,
backend::Reply *reply) override {
reply->set_message(std::string("OK"));
return GStatus::OK;
}
GStatus Free(ServerContext *, const backend::HealthMessage *,
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
if (g_engine) { ds4_engine_close(g_engine); g_engine = nullptr; }
result->set_success(true);
return GStatus::OK;
}
GStatus LoadModel(ServerContext *, const backend::ModelOptions *request,
backend::Result *result) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (g_engine) {
if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
ds4_engine_close(g_engine);
g_engine = nullptr;
}
std::string model_path = request->modelfile();
if (model_path.empty()) model_path = request->model();
if (model_path.empty()) {
result->set_success(false);
result->set_message("ds4: ModelOptions.Model or .ModelFile must be set");
return GStatus::OK;
}
std::string mtp_path;
int mtp_draft = 0;
float mtp_margin = 3.0f;
for (const auto &opt : request->options()) {
auto [k, v] = split_option(opt);
if (k == "mtp_path") mtp_path = v;
else if (k == "mtp_draft") mtp_draft = std::stoi(v);
else if (k == "mtp_margin") mtp_margin = std::stof(v);
else if (k == "kv_cache_dir") g_kv_cache_dir = v;
}
g_kv_cache.SetDir(g_kv_cache_dir);
ds4_engine_options opt = {};
opt.model_path = model_path.c_str();
opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
opt.n_threads = request->threads() > 0 ? request->threads() : 0;
opt.mtp_draft_tokens = mtp_draft;
opt.mtp_margin = mtp_margin;
opt.directional_steering_file = nullptr;
opt.warm_weights = false;
opt.quality = false;
#if defined(DS4_NO_GPU)
opt.backend = DS4_BACKEND_CPU;
#elif defined(__APPLE__)
opt.backend = DS4_BACKEND_METAL;
#else
opt.backend = DS4_BACKEND_CUDA;
#endif
int rc = ds4_engine_open(&g_engine, &opt);
if (rc != 0 || !g_engine) {
result->set_success(false);
result->set_message("ds4_engine_open failed (rc=" + std::to_string(rc) + ")");
return GStatus::OK;
}
g_ctx_size = request->contextsize() > 0 ? request->contextsize() : 32768;
rc = ds4_session_create(&g_session, g_engine, g_ctx_size);
if (rc != 0 || !g_session) {
ds4_engine_close(g_engine);
g_engine = nullptr;
result->set_success(false);
result->set_message("ds4_session_create failed (rc=" + std::to_string(rc) + ")");
return GStatus::OK;
}
result->set_success(true);
result->set_message("loaded " + model_path);
return GStatus::OK;
}
GStatus TokenizeString(ServerContext *, const backend::PredictOptions *request,
backend::TokenizationResponse *response) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine) return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
ds4_tokens out = {};
ds4_tokenize_text(g_engine, request->prompt().c_str(), &out);
for (int i = 0; i < out.len; ++i) response->add_tokens(out.v[i]);
response->set_length(out.len);
ds4_tokens_free(&out);
return GStatus::OK;
}
GStatus Predict(ServerContext *, const backend::PredictOptions *request,
backend::Reply *reply) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
CollectCtx collect = {g_engine, "", {}, reply, 0, {}, "", ""};
std::string cache_key = render_prompt_text(request);
size_t cache_hit = maybe_load_cache(cache_key);
(void)cache_hit; // future: skip prompt prefix if hit covers full prompt
// Manual generation loop on g_session. When MTP speculative weights
// were loaded (LoadModel option 'mtp_path:'), we use the
// ds4_session_eval_speculative_argmax path which may accept N>1
// tokens per outer iteration. Otherwise per-token argmax + eval.
// Either way g_session advances so the disk KV cache picks up a
// real checkpoint after the call (see maybe_save_cache below).
char err[256] = {0};
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
int prompt_len = prompt.len;
ds4_tokens_free(&prompt);
if (rc == 0) {
const int eos = ds4_token_eos(g_engine);
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
int produced = 0;
while (produced < n_predict) {
SampleParams sp = compute_sample_params(request, collect.parser, think_enabled);
int first;
if (sp.temperature <= 0.0f) {
first = ds4_session_argmax(g_session);
} else {
first = ds4_session_sample(g_session,
sp.temperature, sp.top_k,
sp.top_p, sp.min_p, get_rng());
}
if (first == eos) break;
// MTP only when sampling is greedy (ds4-server gate).
if (draft_max > 0 && sp.temperature <= 0.0f) {
constexpr int kAcceptedMax = 8;
int accepted[kAcceptedMax];
int cap = std::min(kAcceptedMax, draft_max + 1);
int n = ds4_session_eval_speculative_argmax(
g_session, first, draft_max, eos,
accepted, cap, err, sizeof(err));
if (n < 0) { rc = -1; break; }
bool stop = false;
for (int j = 0; j < n; ++j) {
if (accepted[j] == eos) { stop = true; break; }
collect_emit(&collect, accepted[j]);
if (++produced >= n_predict) { stop = true; break; }
}
if (stop) break;
} else {
collect_emit(&collect, first);
if (++produced >= n_predict) break;
rc = ds4_session_eval(g_session, first, err, sizeof(err));
if (rc != 0) break;
}
}
collect_done(&collect);
}
maybe_save_cache(cache_key);
// Flush any buffered parser state.
std::vector<ds4cpp::ParserEvent> events;
collect.parser.Flush(events);
apply_events(&collect, events);
if (rc != 0) {
return GStatus(StatusCode::INTERNAL,
std::string("ds4 generation failed: ") + err);
}
// Emit one ChatDelta with content/reasoning/tool_calls.
auto *delta = reply->add_chat_deltas();
delta->set_content(collect.content_buf);
delta->set_reasoning_content(collect.reasoning_buf);
for (size_t i = 0; i < collect.pending.size(); ++i) {
auto *tc = delta->add_tool_calls();
tc->set_index(static_cast<int32_t>(i));
tc->set_id(collect.pending[i].id);
tc->set_name(collect.pending[i].name);
tc->set_arguments(collect.pending[i].args);
}
reply->set_message(collect.raw_buf);
reply->set_tokens(collect.tokens);
reply->set_prompt_tokens(prompt_len);
return GStatus::OK;
}
GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
ServerWriter<backend::Reply> *writer) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
if (!g_engine || !g_session) {
return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
}
ds4_tokens prompt = {};
build_prompt(g_engine, request, &prompt);
int n_predict = request->tokens() > 0 ? request->tokens() : 256;
StreamCtx s = {g_engine, writer, {}, 0, false, {}};
std::string cache_key = render_prompt_text(request);
size_t cache_hit = maybe_load_cache(cache_key);
(void)cache_hit;
// Manual loop on g_session - see Predict() above for the rationale.
// MTP speculative path used when ds4_engine_mtp_draft_tokens > 0.
char err[256] = {0};
int rc = ds4_session_sync(g_session, &prompt, err, sizeof(err));
ds4_tokens_free(&prompt);
if (rc == 0) {
const int eos = ds4_token_eos(g_engine);
const int draft_max = ds4_engine_mtp_draft_tokens(g_engine);
const bool think_enabled = ds4_think_mode_enabled(parse_think_mode(request));
int produced = 0;
while (produced < n_predict && !s.aborted) {
SampleParams sp = compute_sample_params(request, s.parser, think_enabled);
int first;
if (sp.temperature <= 0.0f) {
first = ds4_session_argmax(g_session);
} else {
first = ds4_session_sample(g_session,
sp.temperature, sp.top_k,
sp.top_p, sp.min_p, get_rng());
}
if (first == eos) break;
if (draft_max > 0 && sp.temperature <= 0.0f) {
constexpr int kAcceptedMax = 8;
int accepted[kAcceptedMax];
int cap = std::min(kAcceptedMax, draft_max + 1);
int n = ds4_session_eval_speculative_argmax(
g_session, first, draft_max, eos,
accepted, cap, err, sizeof(err));
if (n < 0) { rc = -1; break; }
bool stop = false;
for (int j = 0; j < n; ++j) {
if (accepted[j] == eos) { stop = true; break; }
stream_emit(&s, accepted[j]);
if (s.aborted) { stop = true; break; }
if (++produced >= n_predict) { stop = true; break; }
}
if (stop) break;
} else {
stream_emit(&s, first);
if (s.aborted || ++produced >= n_predict) break;
rc = ds4_session_eval(g_session, first, err, sizeof(err));
if (rc != 0) break;
}
}
stream_done(&s);
}
maybe_save_cache(cache_key);
// Flush parser state.
std::vector<ds4cpp::ParserEvent> events;
s.parser.Flush(events);
if (!events.empty() && !s.aborted) {
backend::Reply reply;
auto *delta = reply.add_chat_deltas();
for (const auto &e : events) {
if (e.type == ds4cpp::ParserEvent::CONTENT) {
delta->set_content(delta->content() + e.text);
} else if (e.type == ds4cpp::ParserEvent::REASONING) {
delta->set_reasoning_content(delta->reasoning_content() + e.text);
}
}
s.writer->Write(reply);
}
if (rc != 0 && !s.aborted) {
return GStatus(StatusCode::INTERNAL,
std::string("ds4 generation failed: ") + err);
}
return GStatus::OK;
}
GStatus Status(ServerContext *, const backend::HealthMessage *,
backend::StatusResponse *response) override {
std::lock_guard<std::mutex> lock(g_engine_mu);
response->set_state(g_engine ? backend::StatusResponse::READY
: backend::StatusResponse::UNINITIALIZED);
return GStatus::OK;
}
};
void RunServer(const std::string &addr) {
DS4Backend service;
grpc::EnableDefaultHealthCheckService(true);
grpc::reflection::InitProtoReflectionServerBuilderPlugin();
ServerBuilder builder;
builder.AddListeningPort(addr, grpc::InsecureServerCredentials());
builder.RegisterService(&service);
builder.SetMaxReceiveMessageSize(64 * 1024 * 1024);
builder.SetMaxSendMessageSize(64 * 1024 * 1024);
std::unique_ptr<Server> server(builder.BuildAndStart());
if (!server) {
std::cerr << "ds4 grpc-server: failed to bind " << addr << "\n";
std::exit(1);
}
g_server = server.get();
std::cerr << "ds4 grpc-server listening on " << addr << "\n";
server->Wait();
}
void signal_handler(int) {
if (auto *srv = g_server.load()) {
srv->Shutdown(std::chrono::system_clock::now() +
std::chrono::seconds(3));
}
}
} // namespace
int main(int argc, char *argv[]) {
std::string addr = "127.0.0.1:50051";
for (int i = 1; i < argc; ++i) {
std::string a = argv[i];
const std::string addr_flag = "--addr=";
if (a.rfind(addr_flag, 0) == 0) addr = a.substr(addr_flag.size());
else if (a == "--addr" && i + 1 < argc) addr = argv[++i];
else if (a == "--help" || a == "-h") {
std::cout << "Usage: grpc-server --addr=HOST:PORT\n";
return 0;
}
}
std::signal(SIGINT, signal_handler);
std::signal(SIGTERM, signal_handler);
RunServer(addr);
return 0;
}

View File

@@ -0,0 +1,205 @@
#include "kv_cache.h"
#include <cerrno>
#include <cstdio>
#include <cstring>
#include <dirent.h>
#include <fstream>
#include <sys/stat.h>
#include <vector>
namespace ds4cpp {
namespace {
// Minimal SHA1 (public domain reference). 30 lines; used only here.
struct Sha1 {
uint32_t h[5];
uint64_t bits;
uint8_t block[64];
size_t used;
Sha1() { h[0]=0x67452301; h[1]=0xEFCDAB89; h[2]=0x98BADCFE; h[3]=0x10325476; h[4]=0xC3D2E1F0; bits=0; used=0; }
static uint32_t rol(uint32_t x, int n){ return (x<<n)|(x>>(32-n)); }
void transform(const uint8_t *b) {
uint32_t w[80];
for (int i=0;i<16;i++) w[i] = (uint32_t)b[i*4]<<24 | (uint32_t)b[i*4+1]<<16 | (uint32_t)b[i*4+2]<<8 | b[i*4+3];
for (int i=16;i<80;i++) w[i] = rol(w[i-3]^w[i-8]^w[i-14]^w[i-16], 1);
uint32_t a=h[0],bb=h[1],c=h[2],d=h[3],e=h[4];
for (int i=0;i<80;i++) {
uint32_t f,k;
if (i<20) { f=(bb&c)|((~bb)&d); k=0x5A827999; }
else if (i<40) { f=bb^c^d; k=0x6ED9EBA1; }
else if (i<60) { f=(bb&c)|(bb&d)|(c&d); k=0x8F1BBCDC; }
else { f=bb^c^d; k=0xCA62C1D6; }
uint32_t t = rol(a,5)+f+e+k+w[i];
e=d; d=c; c=rol(bb,30); bb=a; a=t;
}
h[0]+=a; h[1]+=bb; h[2]+=c; h[3]+=d; h[4]+=e;
}
void update(const void *p, size_t n) {
const uint8_t *bp = (const uint8_t*)p;
bits += (uint64_t)n*8;
while (n) {
size_t take = 64-used;
if (take>n) take=n;
std::memcpy(block+used, bp, take);
used += take; bp += take; n -= take;
if (used == 64) { transform(block); used = 0; }
}
}
void final(uint8_t out[20]) {
uint8_t pad[64] = {0x80};
size_t padlen = (used < 56) ? (56-used) : (120-used);
uint64_t lb = bits;
uint8_t len[8];
for (int i=0;i<8;i++) len[7-i] = (uint8_t)(lb >> (i*8));
update(pad, padlen);
update(len, 8);
for (int i=0;i<5;i++) {
out[i*4] = h[i]>>24;
out[i*4+1] = h[i]>>16;
out[i*4+2] = h[i]>>8;
out[i*4+3] = h[i];
}
}
};
std::string mkdir_p(const std::string &d) {
if (d.empty()) return d;
struct stat st{};
if (stat(d.c_str(), &st) == 0) return d;
mkdir(d.c_str(), 0755);
return d;
}
bool file_exists(const std::string &p) {
struct stat st{};
return stat(p.c_str(), &st) == 0;
}
} // namespace
std::string Sha1Hex(const void *data, size_t len) {
Sha1 s;
s.update(data, len);
uint8_t out[20];
s.final(out);
char hex[41];
for (int i = 0; i < 20; ++i) std::snprintf(hex + i*2, 3, "%02x", out[i]);
hex[40] = 0;
return std::string(hex);
}
KvCache::KvCache() = default;
void KvCache::SetDir(const std::string &dir) {
dir_ = dir;
if (!dir_.empty()) {
mkdir_p(dir_);
std::fprintf(stderr, "ds4 KvCache: enabled at %s\n", dir_.c_str());
} else {
std::fprintf(stderr, "ds4 KvCache: disabled (no dir set)\n");
}
}
std::string KvCache::Path(const std::string &rendered_text) const {
if (dir_.empty()) return "";
return dir_ + "/" + Sha1Hex(rendered_text.data(), rendered_text.size()) + ".kv";
}
size_t KvCache::LoadLongestPrefix(ds4_session *session,
const std::string &rendered_text,
int ctx_size) {
if (dir_.empty() || !session) return 0;
// Strategy: enumerate all .kv files in dir, read their stored prefix
// header, pick the longest one that is also a prefix of rendered_text.
DIR *d = opendir(dir_.c_str());
if (!d) return 0;
struct dirent *de;
size_t best_len = 0;
std::string best_path;
while ((de = readdir(d)) != nullptr) {
std::string name = de->d_name;
if (name.size() < 4 || name.substr(name.size()-3) != ".kv") continue;
std::string path = dir_ + "/" + name;
std::ifstream f(path, std::ios::binary);
if (!f) continue;
char magic[4]; f.read(magic, 4);
if (f.gcount() != 4 || std::memcmp(magic, "DS4G", 4) != 0) continue;
uint32_t version=0, file_ctx=0, prefix_len=0;
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
if (version != 1) continue;
if ((int)file_ctx != ctx_size) continue;
if (prefix_len > rendered_text.size()) continue;
std::vector<char> prefix(prefix_len);
f.read(prefix.data(), prefix_len);
if (std::memcmp(prefix.data(), rendered_text.data(), prefix_len) != 0) continue;
if (prefix_len > best_len) {
best_len = prefix_len;
best_path = path;
}
}
closedir(d);
if (best_len == 0) return 0;
// Load best_path's payload into session.
std::ifstream f(best_path, std::ios::binary);
char magic[4]; f.read(magic, 4);
uint32_t version, file_ctx, prefix_len;
f.read((char*)&version, 4); f.read((char*)&file_ctx, 4); f.read((char*)&prefix_len, 4);
f.seekg(prefix_len, std::ios::cur);
uint64_t payload_bytes = 0;
f.read((char*)&payload_bytes, 8);
// ds4_session_load_payload reads from a FILE*; reopen via fopen.
FILE *fp = std::fopen(best_path.c_str(), "rb");
if (!fp) return 0;
// Seek past header + prefix + payload_bytes field.
std::fseek(fp, 4 + 4 + 4 + 4 + prefix_len + 8, SEEK_SET);
char errbuf[256] = {0};
int rc = ds4_session_load_payload(session, fp, payload_bytes, errbuf, sizeof(errbuf));
std::fclose(fp);
if (rc != 0) return 0;
return best_len;
}
void KvCache::Save(ds4_session *session, const std::string &rendered_text, int ctx_size) {
if (dir_.empty()) {
std::fprintf(stderr, "ds4 KvCache::Save: skipped (dir empty)\n");
return;
}
if (!session) {
std::fprintf(stderr, "ds4 KvCache::Save: skipped (session null)\n");
return;
}
std::string path = Path(rendered_text);
uint64_t payload_bytes = ds4_session_payload_bytes(session);
std::fprintf(stderr, "ds4 KvCache::Save: path=%s payload_bytes=%llu prefix_len=%zu\n",
path.c_str(), (unsigned long long)payload_bytes, rendered_text.size());
FILE *fp = std::fopen(path.c_str(), "wb");
if (!fp) {
std::fprintf(stderr, "ds4 KvCache::Save: fopen failed: %s\n", std::strerror(errno));
return;
}
char magic[4] = {'D','S','4','G'};
uint32_t version = 1;
uint32_t ctx = static_cast<uint32_t>(ctx_size);
uint32_t prefix_len = static_cast<uint32_t>(rendered_text.size());
std::fwrite(magic, 4, 1, fp);
std::fwrite(&version, 4, 1, fp);
std::fwrite(&ctx, 4, 1, fp);
std::fwrite(&prefix_len, 4, 1, fp);
std::fwrite(rendered_text.data(), prefix_len, 1, fp);
std::fwrite(&payload_bytes, 8, 1, fp);
char errbuf[256] = {0};
int rc = ds4_session_save_payload(session, fp, errbuf, sizeof(errbuf));
std::fclose(fp);
if (rc != 0) {
std::fprintf(stderr, "ds4 KvCache::Save: ds4_session_save_payload rc=%d err=%s; removing %s\n",
rc, errbuf, path.c_str());
std::remove(path.c_str());
} else {
std::fprintf(stderr, "ds4 KvCache::Save: wrote %s ok\n", path.c_str());
}
}
} // namespace ds4cpp

View File

@@ -0,0 +1,44 @@
#pragma once
#include <string>
extern "C" {
#include "ds4.h"
}
namespace ds4cpp {
// Disk-backed KV cache for ds4 sessions. Keyed by SHA1(rendered prompt prefix).
// Format (our own, NOT bit-compatible with ds4-server's KVC files - interop
// is a follow-up plan):
//
// "DS4G" (4 bytes magic) + u32 version=1 + u32 ctx_size +
// u32 prefix_text_len + prefix_text + u64 payload_bytes + payload
class KvCache {
public:
KvCache(); // disabled (dir empty)
// Set the cache directory. Empty disables.
void SetDir(const std::string &dir);
// Returns the cache file path for a given rendered text prefix.
std::string Path(const std::string &rendered_text) const;
// Look up the longest cached prefix that is also a prefix of
// `rendered_text`. Loads it into `session` if found. Returns the
// matched prefix length in bytes (0 if no hit).
size_t LoadLongestPrefix(ds4_session *session,
const std::string &rendered_text,
int ctx_size);
// Save the current session, associated with this rendered text prefix.
void Save(ds4_session *session, const std::string &rendered_text, int ctx_size);
bool enabled() const { return !dir_.empty(); }
private:
std::string dir_;
};
// Compute SHA1 of arbitrary bytes; returns 40-char hex.
std::string Sha1Hex(const void *data, size_t len);
} // namespace ds4cpp

39
backend/cpp/ds4/package.sh Executable file
View File

@@ -0,0 +1,39 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath "$0")")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p "$CURDIR/package/lib"
cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
cp -rfv "$CURDIR/run.sh" "$CURDIR/package/"
UNAME_S=$(uname -s)
if [ "$UNAME_S" = "Darwin" ]; then
# Darwin: bundle dylibs via otool -L (handled by scripts/build/ds4-darwin.sh).
echo "package.sh: Darwin handled by ds4-darwin.sh"
exit 0
fi
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
cp -arfLv /lib64/ld-linux-x86-64.so.2 "$CURDIR/package/lib/ld.so"
LIBDIR=/lib/x86_64-linux-gnu
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
cp -arfLv /lib/ld-linux-aarch64.so.1 "$CURDIR/package/lib/ld.so"
LIBDIR=/lib/aarch64-linux-gnu
else
echo "package.sh: unknown architecture" >&2; exit 1
fi
for lib in libc.so.6 libgcc_s.so.1 libstdc++.so.6 libm.so.6 libgomp.so.1 \
libdl.so.2 librt.so.1 libpthread.so.0; do
cp -arfLv "$LIBDIR/$lib" "$CURDIR/package/lib/$lib"
done
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "ds4 package contents:"
ls -lah "$CURDIR/package/" "$CURDIR/package/lib/"

9
backend/cpp/ds4/run.sh Executable file
View File

@@ -0,0 +1,9 @@
#!/bin/bash
# Entry point for the ds4 backend image / BACKEND_BINARY mode.
set -e
CURDIR=$(dirname "$(realpath "$0")")
export LD_LIBRARY_PATH="$CURDIR/lib:$LD_LIBRARY_PATH"
if [ -f "$CURDIR/lib/ld.so" ]; then
exec "$CURDIR/lib/ld.so" "$CURDIR/grpc-server" "$@"
fi
exec "$CURDIR/grpc-server" "$@"

View File

@@ -1,5 +1,5 @@
IK_LLAMA_VERSION?=d4824131580b94ffa7b0e91c955e2b237c2fe16e
IK_LLAMA_VERSION?=c35189d83c91aad780aba62b89f2830cb2916223
LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp
CMAKE_ARGS?=

View File

@@ -326,7 +326,7 @@ struct llama_client_slot
char buffer[512];
double t_token = t_prompt_processing / num_prompt_tokens_processed;
double n_tokens_second = 1e3 / t_prompt_processing * num_prompt_tokens_processed;
sprintf(buffer, "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
snprintf(buffer, sizeof(buffer), "prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)",
t_prompt_processing, num_prompt_tokens_processed,
t_token, n_tokens_second);
LOG_INFO(buffer, {
@@ -340,7 +340,7 @@ struct llama_client_slot
t_token = t_token_generation / n_decoded;
n_tokens_second = 1e3 / t_token_generation * n_decoded;
sprintf(buffer, "generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)",
snprintf(buffer, sizeof(buffer), "generation eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)",
t_token_generation, n_decoded,
t_token, n_tokens_second);
LOG_INFO(buffer, {
@@ -352,7 +352,7 @@ struct llama_client_slot
{"n_tokens_second", n_tokens_second},
});
sprintf(buffer, " total time = %10.2f ms", t_prompt_processing + t_token_generation);
snprintf(buffer, sizeof(buffer), " total time = %10.2f ms", t_prompt_processing + t_token_generation);
LOG_INFO(buffer, {
{"slot_id", id},
{"task_id", task_id},
@@ -686,7 +686,16 @@ struct llama_server_context
slot->sparams.mirostat_eta = json_value(data, "mirostat_eta", default_sparams.mirostat_eta);
slot->params.n_keep = json_value(data, "n_keep", slot->params.n_keep);
slot->sparams.seed = json_value(data, "seed", default_sparams.seed);
slot->sparams.grammar = json_value(data, "grammar", default_sparams.grammar);
{
// upstream changed common_params_sampling::grammar from std::string to
// the common_grammar struct (type + grammar). The incoming JSON still
// carries a plain string, so build the user-provided grammar here and
// fall back to the server default when the request omits it.
std::string grammar_str = json_value(data, "grammar", std::string());
slot->sparams.grammar = grammar_str.empty()
? default_sparams.grammar
: common_grammar{COMMON_GRAMMAR_TYPE_USER, std::move(grammar_str)};
}
slot->sparams.n_probs = json_value(data, "n_probs", default_sparams.n_probs);
slot->sparams.min_keep = json_value(data, "min_keep", default_sparams.min_keep);
slot->sparams.grammar_triggers = grammar_triggers;
@@ -1232,7 +1241,7 @@ struct llama_server_context
// {"logit_bias", slot.sparams.logit_bias},
{"n_probs", slot.sparams.n_probs},
{"min_keep", slot.sparams.min_keep},
{"grammar", slot.sparams.grammar},
{"grammar", slot.sparams.grammar.grammar},
{"samplers", samplers}
};
}

View File

@@ -0,0 +1,11 @@
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2494,7 +2494,7 @@
}
new_data = work.data();
- new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr);
+ new_size = ggml_quantize_chunk(new_type, f32_data, new_data, 0, n_elms/cur->ne[0], cur->ne[0], nullptr, nullptr);
} else {
new_type = cur->type;
new_data = cur->data;

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=5a4cd6741fc33227cdacb329f355ab21f8481de2
LLAMA_VERSION?=87589042cac2c390cec8d68fb2fad64e0a2a252a
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=
@@ -34,6 +34,9 @@ else ifeq ($(BUILD_TYPE),hipblas)
export CXX=$(ROCM_HOME)/llvm/bin/clang++
export CC=$(ROCM_HOME)/llvm/bin/clang
AMDGPU_TARGETS?=gfx908,gfx90a,gfx942,gfx950,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
ifeq ($(strip $(AMDGPU_TARGETS)),)
$(error AMDGPU_TARGETS is emptyset it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101)
endif
CMAKE_ARGS+=-DGGML_HIP=ON -DAMDGPU_TARGETS=$(AMDGPU_TARGETS)
else ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=1

View File

@@ -10,6 +10,14 @@
#include "server-task.cpp"
#include "server-queue.cpp"
#include "server-common.cpp"
// server-chat.cpp exists only in llama.cpp after the upstream refactor that
// split OAI/Anthropic/Responses/transcription conversion helpers out of
// server-common.cpp. When present, server-context.cpp and server-task.cpp
// above call into it, so we must pull its definitions into this TU or the
// link fails. __has_include keeps the source compatible with older pins.
#if __has_include("server-chat.cpp")
#include "server-chat.cpp"
#endif
#include "server-context.cpp"
// LocalAI
@@ -24,10 +32,13 @@
#include <grpcpp/health_check_service_interface.h>
#include <grpcpp/security/server_credentials.h>
#include <regex>
#include <algorithm>
#include <atomic>
#include <cstdlib>
#include <fstream>
#include <iterator>
#include <list>
#include <map>
#include <mutex>
#include <signal.h>
#include <thread>
@@ -434,11 +445,25 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// Draft model for speculative decoding
if (!request->draftmodel().empty()) {
params.speculative.mparams_dft.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type
params.speculative.draft.mparams.path = request->draftmodel();
// Default to draft type if a draft model is set but no explicit type.
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
// vector; the turboquant fork still uses the legacy scalar. The
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
// Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
// in ggml-org/llama.cpp#22964; the fork still uses the old name.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
}
#else
const bool no_spec_type = params.speculative.types.empty() ||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
if (no_spec_type) {
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
}
#endif
}
// params.model_alias ??
@@ -634,6 +659,21 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.no_op_offload = false;
}
} else if (!strcmp(optname, "split_mode") || !strcmp(optname, "sm")) {
// Accepts: none | layer | row | tensor (the latter requires a llama.cpp build
// that includes ggml-org/llama.cpp#19378, FlashAttention enabled, and KV-cache
// quantization disabled).
if (optval != NULL) {
if (optval_str == "none") {
params.split_mode = LLAMA_SPLIT_MODE_NONE;
} else if (optval_str == "layer") {
params.split_mode = LLAMA_SPLIT_MODE_LAYER;
} else if (optval_str == "row") {
params.split_mode = LLAMA_SPLIT_MODE_ROW;
} else if (optval_str == "tensor") {
params.split_mode = LLAMA_SPLIT_MODE_TENSOR;
}
}
} else if (!strcmp(optname, "kv_unified") || !strcmp(optname, "unified_kv")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.kv_unified = true;
@@ -648,49 +688,360 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
// If conversion fails, keep default value (8)
}
}
// --- physical batch size (upstream -ub / --ubatch-size) ---
// Note: line ~482 already aliases n_ubatch to n_batch as a default; this
// option lets users decouple the two (useful for embeddings/rerank).
} else if (!strcmp(optname, "n_ubatch") || !strcmp(optname, "ubatch")) {
if (optval != NULL) {
try { params.n_ubatch = std::stoi(optval_str); } catch (...) {}
}
// --- main-model batch threads (upstream -tb / --threads-batch) ---
} else if (!strcmp(optname, "threads_batch") || !strcmp(optname, "n_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- pooling type for embeddings (upstream --pooling) ---
} else if (!strcmp(optname, "pooling_type") || !strcmp(optname, "pooling")) {
if (optval != NULL) {
if (optval_str == "none") params.pooling_type = LLAMA_POOLING_TYPE_NONE;
else if (optval_str == "mean") params.pooling_type = LLAMA_POOLING_TYPE_MEAN;
else if (optval_str == "cls") params.pooling_type = LLAMA_POOLING_TYPE_CLS;
else if (optval_str == "last") params.pooling_type = LLAMA_POOLING_TYPE_LAST;
else if (optval_str == "rank") params.pooling_type = LLAMA_POOLING_TYPE_RANK;
// unknown values silently leave UNSPECIFIED (auto-detect)
}
// --- llama log verbosity threshold (upstream -lv / --verbosity) ---
} else if (!strcmp(optname, "verbosity")) {
if (optval != NULL) {
try { params.verbosity = std::stoi(optval_str); } catch (...) {}
}
// --- O_DIRECT model loading (upstream --direct-io) ---
} else if (!strcmp(optname, "direct_io") || !strcmp(optname, "use_direct_io")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.use_direct_io = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.use_direct_io = false;
}
// --- embedding normalization (upstream --embd-normalize) ---
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm
} else if (!strcmp(optname, "embd_normalize") || !strcmp(optname, "embedding_normalize")) {
if (optval != NULL) {
try { params.embd_normalize = std::stoi(optval_str); } catch (...) {}
}
// --- reasoning parser (upstream --reasoning-format) ---
// Picks the parser for <think> blocks emitted by reasoning models.
// none / auto / deepseek / deepseek-legacy
} else if (!strcmp(optname, "reasoning_format")) {
if (optval != NULL) {
if (optval_str == "none") params.reasoning_format = COMMON_REASONING_FORMAT_NONE;
else if (optval_str == "auto") params.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
else if (optval_str == "deepseek") params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
else if (optval_str == "deepseek-legacy" || optval_str == "deepseek_legacy")
params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY;
// unknown values silently keep the upstream default (DEEPSEEK)
}
// --- reasoning budget (upstream --reasoning-budget) ---
// -1 unlimited, 0 disabled, >0 token budget for thinking blocks.
// Distinct from per-request `enable_thinking` (chat_template_kwargs).
} else if (!strcmp(optname, "enable_reasoning") || !strcmp(optname, "reasoning_budget")) {
if (optval != NULL) {
try { params.enable_reasoning = std::stoi(optval_str); } catch (...) {}
}
// --- prefill assistant turn (upstream --no-prefill-assistant) ---
} else if (!strcmp(optname, "prefill_assistant")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.prefill_assistant = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.prefill_assistant = false;
}
// --- mmproj GPU offload (upstream --no-mmproj-offload, inverted) ---
} else if (!strcmp(optname, "mmproj_use_gpu") || !strcmp(optname, "mmproj_offload")) {
if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
params.mmproj_use_gpu = true;
} else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
params.mmproj_use_gpu = false;
}
// --- per-image vision token budget (upstream --image-min/max-tokens) ---
} else if (!strcmp(optname, "image_min_tokens")) {
if (optval != NULL) {
try { params.image_min_tokens = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "image_max_tokens")) {
if (optval != NULL) {
try { params.image_max_tokens = std::stoi(optval_str); } catch (...) {}
}
// --- main-model tensor buffer overrides (upstream --override-tensor) ---
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// Mirrors the existing `draft_override_tensor` parser below.
} else if (!strcmp(optname, "override_tensor") || !strcmp(optname, "tensor_buft_overrides")) {
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
override_names.push_back(name);
params.tensor_buft_overrides.push_back(
{override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
// Speculative decoding options
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
auto type = common_speculative_type_from_name(optval_str);
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
// Fork only knows a single scalar `type`. Take the first comma-
// separated value and assign it via the singular helper.
std::string first = optval_str;
const auto comma = first.find(',');
if (comma != std::string::npos) first = first.substr(0, comma);
auto type = common_speculative_type_from_name(first);
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
params.speculative.type = type;
}
#else
// Upstream switched to a vector of types (comma-separated for multi-type
// chaining via common_speculative_types_from_names). We keep accepting a
// single value here, but also tolerate comma-separated lists.
//
// ggml-org/llama.cpp#22964 also renamed the registered names from
// underscore- to dash-separated form, and replaced the bare
// `draft`/`eagle3` aliases with `draft-simple`/`draft-eagle3`. We
// normalize each token here so existing model configs keep working.
auto normalize_spec_name = [](std::string s) -> std::string {
std::replace(s.begin(), s.end(), '_', '-');
if (s == "draft") return "draft-simple";
if (s == "eagle3") return "draft-eagle3";
return s;
};
std::vector<std::string> names;
std::string item;
for (char c : optval_str) {
if (c == ',') {
if (!item.empty()) { names.push_back(normalize_spec_name(item)); item.clear(); }
} else {
item.push_back(c);
}
}
if (!item.empty()) names.push_back(normalize_spec_name(item));
auto parsed = common_speculative_types_from_names(names);
if (!parsed.empty()) {
params.speculative.types = parsed;
}
#endif
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
if (optval != NULL) {
try { params.speculative.n_max = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_n_min") || !strcmp(optname, "draft_min")) {
if (optval != NULL) {
try { params.speculative.n_min = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_min") || !strcmp(optname, "draft_p_min")) {
if (optval != NULL) {
try { params.speculative.p_min = std::stof(optval_str); } catch (...) {}
try { params.speculative.draft.p_min = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_p_split")) {
if (optval != NULL) {
try { params.speculative.p_split = std::stof(optval_str); } catch (...) {}
try { params.speculative.draft.p_split = std::stof(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_n") || !strcmp(optname, "ngram_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_size_m") || !strcmp(optname, "ngram_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_min_hits") || !strcmp(optname, "ngram_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
try { params.speculative.ngram_simple.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_gpu_layers")) {
if (optval != NULL) {
try { params.speculative.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "draft_ctx_size")) {
if (optval != NULL) {
try { params.speculative.n_ctx = std::stoi(optval_str); } catch (...) {}
}
// The draft context size is no longer a separate field upstream: the draft
// shares the target context size. Accept the option for backward
// compatibility but silently ignore it.
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
// fields. The turboquant fork branched before that, so its build defines
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
// keys become unrecognized (silently dropped, like any unknown opt) for it.
//
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
// the modern build the chain continues with `} else if (...)` instead, so the
// brace count stays balanced under both branches of the preprocessor.
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
}
#else
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
if (optval != NULL) {
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
if (optval != NULL) {
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
}
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
params.speculative.ngram_cache.lookup_cache_static = optval_str;
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams.n_threads = n;
} catch (...) {}
}
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n <= 0) n = (int)std::thread::hardware_concurrency();
params.speculative.draft.cpuparams_batch.n_threads = n;
} catch (...) {}
}
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
const bool enable = (optval == NULL) ||
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
optval_str == "on" || optval_str == "enabled";
if (enable) {
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
}
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
if (optval != NULL) {
try {
int n = std::stoi(optval_str);
if (n < 0) n = 0;
// Keep override-name storage alive for the lifetime of the params struct
// (mirrors upstream arg.cpp behavior with a function-local static).
static std::list<std::string> buft_overrides_draft;
for (int i = 0; i < n; ++i) {
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
params.speculative.draft.tensor_buft_overrides.push_back(
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
}
} catch (...) {}
}
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
ggml_backend_load_all();
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
auto * dev = ggml_backend_dev_get(i);
auto * buft = ggml_backend_dev_buffer_type(dev);
if (buft) {
buft_list[ggml_backend_buft_name(buft)] = buft;
}
}
static std::list<std::string> draft_override_names;
std::string cur;
auto flush = [&](const std::string & spec) {
auto pos = spec.find('=');
if (pos == std::string::npos) return;
const std::string name = spec.substr(0, pos);
const std::string type = spec.substr(pos + 1);
auto it = buft_list.find(type);
if (it == buft_list.end()) return; // unknown buffer type: ignore
draft_override_names.push_back(name);
params.speculative.draft.tensor_buft_overrides.push_back(
{draft_override_names.back().c_str(), it->second});
};
for (char c : optval_str) {
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
else { cur.push_back(c); }
}
if (!cur.empty()) flush(cur);
}
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
}
// Set params.n_parallel from environment variable if not set via options (fallback)
@@ -910,8 +1261,8 @@ public:
if (!params.mmproj.path.empty()) {
error_msg += " (with mmproj: " + params.mmproj.path + ")";
}
if (params.speculative.has_dft() && !params.speculative.mparams_dft.path.empty()) {
error_msg += " (with draft model: " + params.speculative.mparams_dft.path + ")";
if (params.speculative.has_dft() && !params.speculative.draft.mparams.path.empty()) {
error_msg += " (with draft model: " + params.speculative.draft.mparams.path + ")";
}
// Add captured error details if available
@@ -2587,7 +2938,9 @@ public:
}
}
int embd_normalize = 2; // default to Euclidean/L2 norm
// Honor the load-time embd_normalize set via options:embd_normalize.
// -1 none, 0 max-abs, 1 taxicab, 2 L2 (default), >2 p-norm.
int embd_normalize = params_base.embd_normalize;
// create and queue the task
auto rd = ctx_server.get_response_reader();
{
@@ -2681,7 +3034,7 @@ public:
tasks.reserve(documents.size());
for (size_t i = 0; i < documents.size(); i++) {
auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
task.id = rd.queue_tasks.get_new_id();
task.index = i;
@@ -2859,7 +3212,7 @@ public:
// Get template source and reconstruct a common_chat_template for analysis
std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
if (!tmpl_src.empty()) {
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
std::string token_bos, token_eos;
if (vocab) {
auto bos_id = llama_vocab_bos(vocab);

View File

@@ -1,7 +1,7 @@
# Pinned to the HEAD of feature/turboquant-kv-cache on https://github.com/TheTom/llama-cpp-turboquant.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
TURBOQUANT_VERSION?=4d24ad87b8ed2ad160809af41930f1e04b83f234
TURBOQUANT_VERSION?=5aeb2fdbe26cd4c534c6fa15de73cb5749bd0403
LLAMA_REPO?=https://github.com/TheTom/llama-cpp-turboquant
CMAKE_ARGS?=

View File

@@ -1,6 +1,6 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# turboquant build to account for two gaps between upstream and the fork:
# turboquant build to account for the gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types.
@@ -11,6 +11,14 @@
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the turboquant path.
# 3. Revert the `common_params_speculative` field references to the
# pre-refactor flat layout. Upstream ggml-org/llama.cpp#22397 split the
# struct into nested `draft` / `ngram_simple` / `ngram_mod` / etc. members;
# the turboquant fork branched before that PR and still exposes the flat
# `n_max`, `mparams_dft`, `ngram_size_n`, ... fields. The substitutions
# below map the new nested paths back to the legacy flat names so the
# shared grpc-server.cpp keeps compiling against the fork's common.h.
# Drop this block once the fork rebases past #22397.
#
# We patch the *copy* sitting in turboquant-<flavor>-build/, never the original
# under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps compiling
@@ -77,4 +85,70 @@ else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
if grep -q 'params\.speculative\.draft\.\|params\.speculative\.ngram_simple\.' "$SRC"; then
echo "==> patching $SRC to revert common_params_speculative refs to pre-#22397 flat layout"
# Each substitution is the exact post-refactor path → legacy flat field.
# Order doesn't matter because the source paths are disjoint, but we keep
# the most-specific (mparams.path) first for readability.
sed -E \
-e 's/params\.speculative\.draft\.mparams\.path/params.speculative.mparams_dft.path/g' \
-e 's/params\.speculative\.draft\.n_max/params.speculative.n_max/g' \
-e 's/params\.speculative\.draft\.n_min/params.speculative.n_min/g' \
-e 's/params\.speculative\.draft\.p_min/params.speculative.p_min/g' \
-e 's/params\.speculative\.draft\.p_split/params.speculative.p_split/g' \
-e 's/params\.speculative\.draft\.n_gpu_layers/params.speculative.n_gpu_layers/g' \
-e 's/params\.speculative\.draft\.n_ctx/params.speculative.n_ctx/g' \
-e 's/params\.speculative\.ngram_simple\.size_n/params.speculative.ngram_size_n/g' \
-e 's/params\.speculative\.ngram_simple\.size_m/params.speculative.ngram_size_m/g' \
-e 's/params\.speculative\.ngram_simple\.min_hits/params.speculative.ngram_min_hits/g' \
"$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> speculative field rename OK"
else
echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
fi
# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
# exposes the field as `model` on `server_context_impl`. The two call sites
# are in the Rerank and ModelMetadata RPC handlers.
if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> model_tgt rename OK"
else
echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
fi
# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
# grpc-server option parser skips the new option-handler blocks (ngram_mod,
# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
# blocks reference struct fields that simply do not exist in the fork.
if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
else
echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
# Insert the define before the very first `#include` so it precedes all the
# speculative-decoding code paths.
awk '
!done && /^#include/ {
print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
print ""
done = 1
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
fi
echo "==> all patches applied"

View File

@@ -1,47 +0,0 @@
From: LocalAI turboquant backend maintainers <noreply@localai.io>
Subject: ggml-hip: add F16-K + TURBO-V fattn-vec template instances
Upstream commit fa4e8be0a0ce ("fix(cuda): add F16-K + TURBO-V dispatch cases
in fattn.cu") added three new template instance files under ggml-cuda/:
- fattn-vec-instance-f16-turbo2_0.cu
- fattn-vec-instance-f16-turbo3_0.cu
- fattn-vec-instance-f16-turbo4_0.cu
and registered them in ggml/src/ggml-cuda/CMakeLists.txt. The companion
dispatch cases FATTN_VEC_CASES_ALL_D(GGML_TYPE_F16, GGML_TYPE_TURBO{2,3,4}_0)
were added to ggml/src/ggml-cuda/fattn.cu, which is shared with the HIP
build path via hipify.
However, ggml/src/ggml-hip/CMakeLists.txt carries its own explicit list of
template instance sources (used when GGML_CUDA_FA_ALL_QUANTS is OFF, which
is the default) and was never updated for the new F16-K + TURBO-V combos.
The HIP build therefore compiles the dispatch cases (which reference
ggml_cuda_flash_attn_ext_vec_case<D, F16, TURBO*>) without ever compiling
the matching template instantiations, causing a link-time failure in the
-gpu-rocm-hipblas-turboquant CI job.
Add the three new template instance files to ggml-hip's list so the HIP
build links cleanly. Drop this patch once the fork picks up the
corresponding upstream sync in ggml-hip/CMakeLists.txt.
--- a/ggml/src/ggml-hip/CMakeLists.txt
+++ b/ggml/src/ggml-hip/CMakeLists.txt
@@ -85,14 +85,17 @@ else()
../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-q8_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo3_0.cu
+ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo3_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo2_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-q8_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo2_0.cu
+ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo2_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo2_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo2_0-turbo3_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo4_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-q8_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-q8_0-turbo4_0.cu
+ ../ggml-cuda/template-instances/fattn-vec-instance-f16-turbo4_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo3_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo4_0.cu
../ggml-cuda/template-instances/fattn-vec-instance-turbo4_0-turbo2_0.cu

View File

@@ -4,7 +4,6 @@ package main
// It is meant to be used by the main executable that is the server for the specific backend type (falcon, gpt3, etc)
import (
"container/heap"
"errors"
"fmt"
"math"
"slices"
@@ -100,9 +99,16 @@ func sortIntoKeySlicese(keys []*pb.StoresKey) [][]float32 {
}
func (s *Store) Load(opts *pb.ModelOptions) error {
if opts.Model != "" {
return errors.New("not implemented")
}
// local-store is an in-memory vector store with no on-disk artefact to
// load — opts.Model is just a namespace identifier. The old `!= ""` guard
// rejected any non-empty model name with "not implemented", which broke
// callers that pass a namespace to isolate embedding spaces (face vs.
// voice biometrics both go through local-store but need distinct stores
// so ArcFace 512-D and ECAPA-TDNN 192-D don't collide). Namespace
// isolation is already handled upstream: ModelLoader spawns a fresh
// local-store process per (backend, model) tuple, so each namespace is
// its own Store{} instance. Nothing to do here beyond accepting the load.
_ = opts
return nil
}

7
backend/go/localvqe/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
sources/
build/
package/
liblocalvqe.so*
libggml*.so*
localvqe
.localvqe-build.stamp

View File

@@ -0,0 +1,98 @@
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
GOCMD?=go
GO_TAGS?=
JOBS?=$(shell nproc --ignore=1)
# LocalVQE upstream version pin. Bump to a specific commit when picking up
# a new release; `main` works for development but is not reproducible.
LOCALVQE_REPO?=https://github.com/localai-org/LocalVQE
LOCALVQE_VERSION?=72bfb4c6
# LocalVQE handles CPU feature selection internally (it ships the multiple
# libggml-cpu-*.so variants and its loader picks the best one at runtime
# via GGML_BACKEND_DL), so we build a single liblocalvqe.so + the per-CPU
# ggml shared libs and let it sort itself out. No need for a wrapper
# MODULE library or per-AVX backend variants here.
CMAKE_ARGS+=-DLOCALVQE_BUILD_SHARED=ON
CMAKE_ARGS+=-DGGML_BUILD_TESTS=OFF
CMAKE_ARGS+=-DGGML_BUILD_EXAMPLES=OFF
ifeq ($(NATIVE),false)
CMAKE_ARGS+=-DGGML_NATIVE=OFF
endif
# LocalVQE upstream supports CPU + Vulkan only. Other BUILD_TYPE values
# fall through to the default CPU build — Vulkan is already as fast as the
# specialised GPU paths would be on this 1.3 M-parameter model.
ifeq ($(BUILD_TYPE),vulkan)
CMAKE_ARGS+=-DGGML_VULKAN=ON -DLOCALVQE_VULKAN=ON
else ifeq ($(OS),Darwin)
CMAKE_ARGS+=-DGGML_METAL=OFF
endif
# --- Sources ---
sources/LocalVQE:
mkdir -p sources/LocalVQE
cd sources/LocalVQE && \
git init && \
git remote add origin $(LOCALVQE_REPO) && \
git fetch origin && \
git checkout $(LOCALVQE_VERSION) && \
git submodule update --init --recursive --depth 1 --single-branch
# --- Native build ---
#
# Drives cmake directly against the upstream LocalVQE/ggml CMakeLists.
# Produces liblocalvqe.so plus the per-CPU libggml-cpu-*.so variants in
# build/bin/, all of which we copy into the backend directory so package.sh
# can pick them up. The `liblocalvqe.so` rule deliberately uses a sentinel
# stamp file because Make's wildcard tracking would otherwise mis-decide
# about freshness when SOVERSION symlinks are involved.
LIB_SENTINEL=.localvqe-build.stamp
$(LIB_SENTINEL): sources/LocalVQE
mkdir -p build && \
cd build && \
cmake ../sources/LocalVQE/ggml $(CMAKE_ARGS) -DCMAKE_BUILD_TYPE=Release && \
cmake --build . --config Release -j$(JOBS)
# Upstream's CPU build sets GGML_BACKEND_DL=ON + GGML_CPU_ALL_VARIANTS=ON,
# which produces multiple libggml-cpu-*.so files (SSE4.2 / AVX2 / AVX-512)
# that the loader picks at runtime. We must build every target — the
# default `--target localvqe_shared` drops these. CMAKE_LIBRARY_OUTPUT_DIRECTORY
# routes all of them into build/bin; copy them out next to the binary.
cp -P build/bin/liblocalvqe.so* . 2>/dev/null || cp -P build/liblocalvqe.so* .
cp -P build/bin/libggml*.so* . 2>/dev/null || true
touch $(LIB_SENTINEL)
liblocalvqe.so: $(LIB_SENTINEL)
# --- Go binary + packaging ---
localvqe: main.go golocalvqe.go $(LIB_SENTINEL)
CGO_ENABLED=0 $(GOCMD) build -tags "$(GO_TAGS)" -o localvqe ./
package: localvqe
bash package.sh
build: package
clean: purge
rm -rf liblocalvqe.so* libggml*.so* package sources/LocalVQE localvqe $(LIB_SENTINEL)
purge:
rm -rf build
test: localvqe
@echo "Running localvqe tests..."
bash test.sh
@echo "localvqe tests completed."
all: localvqe package
.PHONY: build package clean purge test all

View File

@@ -0,0 +1,610 @@
package main
import (
"encoding/binary"
"fmt"
"io"
"os"
"path/filepath"
"runtime"
"strconv"
"strings"
"unsafe"
"github.com/mudler/LocalAI/pkg/grpc/base"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
"github.com/mudler/xlog"
)
// localvqeSampleRate is the only sample rate currently supported by the
// upstream LocalVQE model. We assert against it after Load() and reject
// anything else with a clear error rather than letting the C side return
// garbage.
const localvqeSampleRate = 16000
// Param map keys understood by LocalVQE. Keep these strings in sync with
// schema.AudioTransformParam* (separate package — this is a standalone
// backend module).
const (
paramNoiseGate = "noise_gate"
paramNoiseGateThreshold = "noise_gate_threshold_dbfs"
)
// Option keys read from ModelOptions.Options[] at Load() time. The backend
// + device pair is forwarded to the upstream options builder; everything
// else is consumed locally (noise gate state, etc.).
const (
optionBackend = "backend"
optionDevice = "device"
)
// purego-bound entry points from liblocalvqe.
//
// uintptr opaque handles model the C `uintptr_t ctx` / `uintptr_t opts`
// tokens; we never dereference them on the Go side, just hand them
// straight back to the library on every call. Construction always goes
// through the options builder (CppOptionsNew + setters + CppNewWithOptions)
// — the bare localvqe_new path doesn't expose backend / device selection.
var (
CppOptionsNew func() uintptr
CppOptionsFree func(opts uintptr)
CppOptionsSetModelPath func(opts uintptr, modelPath string) int32
CppOptionsSetBackend func(opts uintptr, backend string) int32
CppOptionsSetDevice func(opts uintptr, device int32) int32
CppNewWithOptions func(opts uintptr) uintptr
CppFree func(ctx uintptr)
CppProcessF32 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessS16 func(ctx uintptr, mic, ref uintptr, nSamples int32, out uintptr) int32
CppProcessFrameF32 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppProcessFrameS16 func(ctx uintptr, mic, ref uintptr, hopSamples int32, out uintptr) int32
CppReset func(ctx uintptr)
CppLastError func(ctx uintptr) string
CppSampleRate func(ctx uintptr) int32
CppHopLength func(ctx uintptr) int32
CppFFTSize func(ctx uintptr) int32
CppSetNoiseGate func(ctx uintptr, enabled int32, thresholdDBFS float32) int32
CppGetNoiseGate func(ctx uintptr, enabledOut, thresholdDBFSOut uintptr) int32
)
// LocalVQE speaks gRPC against LocalVQE's flat C ABI. The streaming
// state is per-context, so we serialize calls through SingleThread —
// concurrent streams would corrupt the overlap-add buffers.
type LocalVQE struct {
base.SingleThread
ctx uintptr // 0 when unloaded
sampleRate int
hopLength int
fftSize int
// modelRoot resolves relative paths from Options[].
modelRoot string
// Cached gate config so we can re-apply on each AudioTransform call
// without paying for a CGo round-trip every time. Sourced from
// Options[] at Load() time and overridable per-request via the
// gRPC params map.
gateEnabled bool
gateDbfs float32
// Backend / device picked via Options[]. Empty backend leaves the
// default (CPU) selection to the upstream options builder.
backend string
device int32
}
// parseOptions reads opts.Options[] for backend-specific tuning. Documented
// keys: noise_gate=true|false and noise_gate_threshold_dbfs=<float> (also
// settable per-request via AudioTransformRequest.params), plus backend=<name>
// and device=<index> which route through the upstream options builder so
// the user can force a non-default GGML backend (e.g. "Vulkan").
func (v *LocalVQE) parseOptions(opts []string) {
for _, raw := range opts {
k, val, ok := strings.Cut(raw, "=")
if !ok {
k, val, ok = strings.Cut(raw, ":")
if !ok {
continue
}
}
key := strings.TrimSpace(strings.ToLower(k))
val = strings.TrimSpace(val)
switch key {
case paramNoiseGate:
if b, err := strconv.ParseBool(val); err == nil {
v.gateEnabled = b
}
case paramNoiseGateThreshold:
if f, err := strconv.ParseFloat(val, 32); err == nil {
v.gateDbfs = float32(f)
}
case optionBackend:
v.backend = val
case optionDevice:
if d, err := strconv.Atoi(val); err == nil && d >= 0 {
v.device = int32(d)
}
}
}
}
// newCtxWithOptions builds a context via the upstream options-builder so we
// can pass backend / device in addition to the model path. Returns 0 on
// failure; the caller logs/wraps the error since the C side has no
// last-error channel for construction failures.
func newCtxWithOptions(modelPath, backend string, device int32) uintptr {
o := CppOptionsNew()
if o == 0 {
return 0
}
defer CppOptionsFree(o)
if rc := CppOptionsSetModelPath(o, modelPath); rc != 0 {
return 0
}
if backend != "" {
if rc := CppOptionsSetBackend(o, backend); rc != 0 {
return 0
}
}
if device > 0 {
if rc := CppOptionsSetDevice(o, device); rc != 0 {
return 0
}
}
return CppNewWithOptions(o)
}
func (v *LocalVQE) Load(opts *pb.ModelOptions) error {
if opts.ModelFile == "" {
return fmt.Errorf("localvqe: ModelFile is required")
}
modelFile := opts.ModelFile
if !filepath.IsAbs(modelFile) && opts.ModelPath != "" {
modelFile = filepath.Join(opts.ModelPath, modelFile)
}
v.modelRoot = opts.ModelPath
if v.modelRoot == "" {
v.modelRoot = filepath.Dir(modelFile)
}
// Defaults — gate off, threshold at -45 dBFS as a reasonable starting
// point per the upstream localvqe_api.h documentation.
v.gateEnabled = false
v.gateDbfs = -45.0
v.parseOptions(opts.Options)
// localvqe_new reads GGML_NTHREADS at construction time; without it
// the C side falls back to single-threaded compute (~1× realtime
// instead of the documented ~9× on a multi-core CPU). Pass the
// model config's Threads through, defaulting to min(NumCPU, 4).
//
// LocalVQE is 1.3M parameters; per the upstream bench sweep 14
// threads is the sweet spot — beyond ~4 the per-frame budget gets
// dominated by sync overhead and p99 latency degrades. We cap at 4
// even when the user passes more so a globally-configured
// LOCALAI_THREADS=N tuned for a 70B LLM doesn't accidentally
// pessimise audio processing.
const localvqeMaxThreads = 4
threads := int(opts.Threads)
if threads <= 0 {
threads = runtime.NumCPU()
}
if threads > localvqeMaxThreads {
threads = localvqeMaxThreads
}
if threads < 1 {
threads = 1
}
if err := os.Setenv("GGML_NTHREADS", fmt.Sprintf("%d", threads)); err != nil {
return fmt.Errorf("localvqe: setenv GGML_NTHREADS: %w", err)
}
xlog.Info("[localvqe] loading model", "path", modelFile, "threads", threads, "backend", v.backend, "device", v.device, "noise_gate", v.gateEnabled, "threshold_dbfs", v.gateDbfs)
ctx := newCtxWithOptions(modelFile, v.backend, v.device)
if ctx == 0 {
return fmt.Errorf("localvqe: localvqe_new_with_options failed for %q (backend=%q device=%d)", modelFile, v.backend, v.device)
}
v.ctx = ctx
v.sampleRate = int(CppSampleRate(ctx))
v.hopLength = int(CppHopLength(ctx))
v.fftSize = int(CppFFTSize(ctx))
if v.sampleRate != localvqeSampleRate {
CppFree(ctx)
v.ctx = 0
return fmt.Errorf("localvqe: unsupported sample rate %d (only %d Hz is supported)", v.sampleRate, localvqeSampleRate)
}
if v.hopLength <= 0 || v.fftSize <= 0 {
CppFree(ctx)
v.ctx = 0
return fmt.Errorf("localvqe: model reports invalid hop=%d fft=%d", v.hopLength, v.fftSize)
}
if v.gateEnabled {
if rc := CppSetNoiseGate(ctx, 1, v.gateDbfs); rc != 0 {
err := fmt.Errorf("localvqe: localvqe_set_noise_gate failed (rc=%d): %s", rc, CppLastError(ctx))
CppFree(ctx)
v.ctx = 0
return err
}
}
return nil
}
func (v *LocalVQE) Free() error {
if v.ctx != 0 {
CppFree(v.ctx)
v.ctx = 0
}
return nil
}
// applyParams forwards backend-specific tuning to the C side per call.
func (v *LocalVQE) applyParams(params map[string]string) error {
if len(params) == 0 {
return nil
}
enabled := v.gateEnabled
threshold := v.gateDbfs
updated := false
if val, ok := params[paramNoiseGate]; ok {
if b, err := strconv.ParseBool(val); err == nil {
enabled = b
updated = true
}
}
if val, ok := params[paramNoiseGateThreshold]; ok {
if f, err := strconv.ParseFloat(val, 32); err == nil {
threshold = float32(f)
updated = true
}
}
if !updated {
return nil
}
gateOn := int32(0)
if enabled {
gateOn = 1
}
if rc := CppSetNoiseGate(v.ctx, gateOn, threshold); rc != 0 {
return fmt.Errorf("localvqe_set_noise_gate failed (rc=%d): %s", rc, CppLastError(v.ctx))
}
v.gateEnabled = enabled
v.gateDbfs = threshold
return nil
}
func (v *LocalVQE) AudioTransform(req *pb.AudioTransformRequest) (*pb.AudioTransformResult, error) {
if v.ctx == 0 {
return nil, fmt.Errorf("localvqe: no model loaded")
}
if req.AudioPath == "" || req.Dst == "" {
return nil, fmt.Errorf("localvqe: audio_path and dst are required")
}
if err := v.applyParams(req.Params); err != nil {
return nil, err
}
mic, micRate, err := readMonoWAVf32(req.AudioPath)
if err != nil {
return nil, fmt.Errorf("read audio: %w", err)
}
if micRate != v.sampleRate {
return nil, fmt.Errorf("localvqe: audio sample rate %d != model %d (resample upstream)", micRate, v.sampleRate)
}
refProvided := req.ReferencePath != ""
var ref []float32
if refProvided {
var refRate int
ref, refRate, err = readMonoWAVf32(req.ReferencePath)
if err != nil {
return nil, fmt.Errorf("read reference: %w", err)
}
if refRate != v.sampleRate {
return nil, fmt.Errorf("localvqe: reference sample rate %d != model %d", refRate, v.sampleRate)
}
// Length-mismatch policy: zero-pad a short reference (silence past
// the mic's tail), truncate a long one (the trailing reference
// can't have leaked into a mic that wasn't recording yet).
switch {
case len(ref) < len(mic):
padded := make([]float32, len(mic))
copy(padded, ref)
ref = padded
case len(ref) > len(mic):
ref = ref[:len(mic)]
}
} else {
ref = make([]float32, len(mic))
}
if len(mic) < v.fftSize {
return nil, fmt.Errorf("localvqe: audio too short (%d samples, need ≥ %d)", len(mic), v.fftSize)
}
out := make([]float32, len(mic))
rc := CppProcessF32(v.ctx,
uintptr(unsafe.Pointer(&mic[0])),
uintptr(unsafe.Pointer(&ref[0])),
int32(len(mic)),
uintptr(unsafe.Pointer(&out[0])))
if rc != 0 {
return nil, fmt.Errorf("localvqe_process_f32 failed (rc=%d): %s", rc, CppLastError(v.ctx))
}
if err := writeMonoWAVf32(req.Dst, out, v.sampleRate); err != nil {
return nil, fmt.Errorf("write output: %w", err)
}
return &pb.AudioTransformResult{
Dst: req.Dst,
SampleRate: int32(v.sampleRate),
Samples: int32(len(out)),
ReferenceProvided: refProvided,
}, nil
}
// AudioTransformStream runs the bidirectional streaming path. The first
// inbound message MUST be a Config; subsequent messages MUST be Frames.
// A second Config mid-stream resets the streaming state.
func (v *LocalVQE) AudioTransformStream(in <-chan *pb.AudioTransformFrameRequest, out chan<- *pb.AudioTransformFrameResponse) error {
defer close(out)
if v.ctx == 0 {
return fmt.Errorf("localvqe: no model loaded")
}
first, ok := <-in
if !ok {
return nil
}
cfg := first.GetConfig()
if cfg == nil {
return fmt.Errorf("localvqe: first stream message must be a Config")
}
if err := v.applyStreamConfig(cfg); err != nil {
return err
}
hop := v.hopLength
if cfg.FrameSamples != 0 && int(cfg.FrameSamples) != hop {
return fmt.Errorf("localvqe: frame_samples=%d != hop_length=%d", cfg.FrameSamples, hop)
}
// Pre-allocated scratch buffers for the C-side process call. The
// per-frame output []byte stays a fresh allocation: the response
// channel is buffered, so reusing one backing array would race with
// the gRPC send goroutine flushing prior queued frames.
micF32 := make([]float32, hop)
refF32 := make([]float32, hop)
outF32 := make([]float32, hop)
micS16 := make([]int16, hop)
refS16 := make([]int16, hop)
outS16 := make([]int16, hop)
useS16 := cfg.SampleFormat == pb.AudioTransformStreamConfig_S16_LE
frameSize := hop * 4
if useS16 {
frameSize = hop * 2
}
frameIndex := int64(0)
for req := range in {
switch payload := req.Payload.(type) {
case *pb.AudioTransformFrameRequest_Config:
if err := v.applyStreamConfig(payload.Config); err != nil {
return err
}
if payload.Config.Reset_ {
CppReset(v.ctx)
frameIndex = 0
}
continue
case *pb.AudioTransformFrameRequest_Frame:
if len(payload.Frame.AudioPcm) != frameSize {
return fmt.Errorf("localvqe: frame audio bytes=%d expected=%d", len(payload.Frame.AudioPcm), frameSize)
}
refBuf := payload.Frame.ReferencePcm
if len(refBuf) != 0 && len(refBuf) != frameSize {
return fmt.Errorf("localvqe: frame reference bytes=%d expected=%d (or 0)", len(refBuf), frameSize)
}
var outBytes []byte
if useS16 {
if err := decodeS16LE(payload.Frame.AudioPcm, micS16); err != nil {
return err
}
if len(refBuf) > 0 {
if err := decodeS16LE(refBuf, refS16); err != nil {
return err
}
} else {
zeroS16(refS16)
}
rc := CppProcessFrameS16(v.ctx,
uintptr(unsafe.Pointer(&micS16[0])),
uintptr(unsafe.Pointer(&refS16[0])),
int32(hop),
uintptr(unsafe.Pointer(&outS16[0])))
if rc != 0 {
return fmt.Errorf("localvqe_process_frame_s16 (rc=%d): %s", rc, CppLastError(v.ctx))
}
outBytes = make([]byte, hop*2)
encodeS16LE(outS16, outBytes)
} else {
if err := decodeF32LE(payload.Frame.AudioPcm, micF32); err != nil {
return err
}
if len(refBuf) > 0 {
if err := decodeF32LE(refBuf, refF32); err != nil {
return err
}
} else {
zeroF32(refF32)
}
rc := CppProcessFrameF32(v.ctx,
uintptr(unsafe.Pointer(&micF32[0])),
uintptr(unsafe.Pointer(&refF32[0])),
int32(hop),
uintptr(unsafe.Pointer(&outF32[0])))
if rc != 0 {
return fmt.Errorf("localvqe_process_frame_f32 (rc=%d): %s", rc, CppLastError(v.ctx))
}
outBytes = make([]byte, hop*4)
encodeF32LE(outF32, outBytes)
}
out <- &pb.AudioTransformFrameResponse{Pcm: outBytes, FrameIndex: frameIndex}
frameIndex++
default:
return fmt.Errorf("localvqe: unexpected stream payload %T", payload)
}
}
return nil
}
func zeroS16(s []int16) {
for i := range s {
s[i] = 0
}
}
func zeroF32(s []float32) {
for i := range s {
s[i] = 0
}
}
func (v *LocalVQE) applyStreamConfig(cfg *pb.AudioTransformStreamConfig) error {
if cfg.SampleRate != 0 && int(cfg.SampleRate) != v.sampleRate {
return fmt.Errorf("localvqe: sample_rate=%d != model %d", cfg.SampleRate, v.sampleRate)
}
return v.applyParams(cfg.Params)
}
// ---- WAV I/O ----------------------------------------------------------
//
// Minimal mono PCM WAV reader/writer. Only handles the subset LocalVQE
// cares about (mono, 16-bit signed, no extensible chunks). For broader
// audio support the HTTP layer's `audio.NormalizeAudioFile` already
// converts arbitrary input to a canonical WAV before we see it; this
// reader just decodes the canonical shape.
func readMonoWAVf32(path string) ([]float32, int, error) {
f, err := os.Open(path)
if err != nil {
return nil, 0, err
}
defer func() { _ = f.Close() }()
header := make([]byte, 44)
if _, err := io.ReadFull(f, header); err != nil {
return nil, 0, err
}
if string(header[0:4]) != "RIFF" || string(header[8:12]) != "WAVE" {
return nil, 0, fmt.Errorf("not a WAV file")
}
channels := binary.LittleEndian.Uint16(header[22:24])
sampleRate := binary.LittleEndian.Uint32(header[24:28])
bitsPerSample := binary.LittleEndian.Uint16(header[34:36])
if channels != 1 {
return nil, 0, fmt.Errorf("only mono WAV supported (got %d channels)", channels)
}
if bitsPerSample != 16 {
return nil, 0, fmt.Errorf("only 16-bit PCM supported (got %d bits)", bitsPerSample)
}
rest, err := io.ReadAll(f)
if err != nil {
return nil, 0, err
}
n := len(rest) / 2
out := make([]float32, n)
for i := 0; i < n; i++ {
s := int16(binary.LittleEndian.Uint16(rest[i*2 : i*2+2]))
out[i] = float32(s) / 32768.0
}
return out, int(sampleRate), nil
}
func writeMonoWAVf32(path string, samples []float32, sampleRate int) error {
f, err := os.Create(path)
if err != nil {
return err
}
defer func() { _ = f.Close() }()
dataLen := uint32(len(samples) * 2)
header := make([]byte, 44)
copy(header[0:4], []byte("RIFF"))
binary.LittleEndian.PutUint32(header[4:8], 36+dataLen)
copy(header[8:12], []byte("WAVE"))
copy(header[12:16], []byte("fmt "))
binary.LittleEndian.PutUint32(header[16:20], 16) // fmt chunk size
binary.LittleEndian.PutUint16(header[20:22], 1) // PCM
binary.LittleEndian.PutUint16(header[22:24], 1) // mono
binary.LittleEndian.PutUint32(header[24:28], uint32(sampleRate))
binary.LittleEndian.PutUint32(header[28:32], uint32(sampleRate*2)) // byte rate
binary.LittleEndian.PutUint16(header[32:34], 2) // block align
binary.LittleEndian.PutUint16(header[34:36], 16) // bits per sample
copy(header[36:40], []byte("data"))
binary.LittleEndian.PutUint32(header[40:44], dataLen)
if _, err := f.Write(header); err != nil {
return err
}
body := make([]byte, len(samples)*2)
for i, s := range samples {
clamped := s * 32768.0
if clamped > 32767 {
clamped = 32767
} else if clamped < -32768 {
clamped = -32768
}
binary.LittleEndian.PutUint16(body[i*2:i*2+2], uint16(int16(clamped)))
}
_, err = f.Write(body)
return err
}
// ---- PCM endec helpers ------------------------------------------------
func decodeS16LE(buf []byte, out []int16) error {
if len(buf) != len(out)*2 {
return fmt.Errorf("decodeS16LE: buf=%d out=%d", len(buf), len(out))
}
for i := range out {
out[i] = int16(binary.LittleEndian.Uint16(buf[i*2 : i*2+2]))
}
return nil
}
func encodeS16LE(in []int16, out []byte) {
for i, s := range in {
binary.LittleEndian.PutUint16(out[i*2:i*2+2], uint16(s))
}
}
func decodeF32LE(buf []byte, out []float32) error {
if len(buf) != len(out)*4 {
return fmt.Errorf("decodeF32LE: buf=%d out=%d", len(buf), len(out))
}
for i := range out {
bits := binary.LittleEndian.Uint32(buf[i*4 : i*4+4])
out[i] = *(*float32)(unsafe.Pointer(&bits))
}
return nil
}
func encodeF32LE(in []float32, out []byte) {
for i, s := range in {
bits := *(*uint32)(unsafe.Pointer(&s))
binary.LittleEndian.PutUint32(out[i*4:i*4+4], bits)
}
}

View File

@@ -0,0 +1,120 @@
package main
import (
"os"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestLocalVQE(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "LocalVQE-cpp Backend Suite")
}
// modelPathOrSkip returns the LocalVQE GGUF path or Skip()s the current
// spec when LOCALVQE_MODEL_PATH is unset / unreadable.
func modelPathOrSkip() string {
path := os.Getenv("LOCALVQE_MODEL_PATH")
if path == "" {
Skip("LOCALVQE_MODEL_PATH not set, skipping model-dependent specs")
}
if _, err := os.Stat(path); err != nil {
Skip("LOCALVQE_MODEL_PATH unreadable: " + err.Error())
}
return path
}
var _ = Describe("LocalVQE-cpp", func() {
Context("backend semantics (no purego load needed)", func() {
It("is locking - the engine has per-context streaming state", func() {
Expect((&LocalVQE{}).Locking()).To(BeTrue())
})
It("rejects Load with empty ModelFile", func() {
err := (&LocalVQE{}).Load(&pb.ModelOptions{})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("ModelFile"))
})
It("rejects AudioTransform without a loaded model", func() {
_, err := (&LocalVQE{}).AudioTransform(&pb.AudioTransformRequest{
AudioPath: "/tmp/audio.wav",
Dst: "/tmp/out.wav",
})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("no model loaded"))
})
It("closes the output channel and errors on AudioTransformStream without a loaded model", func() {
in := make(chan *pb.AudioTransformFrameRequest, 1)
out := make(chan *pb.AudioTransformFrameResponse, 1)
close(in)
err := (&LocalVQE{}).AudioTransformStream(in, out)
Expect(err).To(HaveOccurred())
_, ok := <-out
Expect(ok).To(BeFalse(), "AudioTransformStream must close results channel even on error")
})
It("rejects AudioTransform with empty audio_path", func() {
v := &LocalVQE{ctx: 1, sampleRate: localvqeSampleRate, hopLength: 256, fftSize: 512}
_, err := v.AudioTransform(&pb.AudioTransformRequest{Dst: "/tmp/out.wav"})
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("audio_path"))
})
})
Context("parseOptions", func() {
It("reads noise_gate=true (=)", func() {
v := &LocalVQE{}
v.parseOptions([]string{"noise_gate=true"})
Expect(v.gateEnabled).To(BeTrue())
})
It("reads noise_gate_threshold_dbfs=-50 (:)", func() {
v := &LocalVQE{}
v.parseOptions([]string{"noise_gate_threshold_dbfs:-50"})
Expect(v.gateDbfs).To(BeNumerically("==", -50.0))
})
It("ignores unknown keys without error", func() {
v := &LocalVQE{}
v.parseOptions([]string{"unknown=value", "another:thing"})
Expect(v.gateEnabled).To(BeFalse())
})
It("is case-insensitive on keys", func() {
v := &LocalVQE{}
v.parseOptions([]string{"NOISE_GATE=true"})
Expect(v.gateEnabled).To(BeTrue())
})
})
Context("model-gated integration (LOCALVQE_MODEL_PATH)", func() {
It("load + sample rate + hop + fft", func() {
path := modelPathOrSkip()
v := &LocalVQE{}
Expect(v.Load(&pb.ModelOptions{ModelFile: path})).To(Succeed())
defer func() { _ = v.Free() }()
Expect(v.sampleRate).To(Equal(localvqeSampleRate))
Expect(v.hopLength).To(Equal(256))
Expect(v.fftSize).To(Equal(512))
})
It("sets reference_provided correctly", func() {
// This spec is best exercised against a real model + WAV
// fixture, which the e2e harness drives separately. Here
// we just assert the expectation when ref is empty.
path := modelPathOrSkip()
v := &LocalVQE{}
Expect(v.Load(&pb.ModelOptions{ModelFile: path})).To(Succeed())
defer func() { _ = v.Free() }()
// Synthetic input; the C side handles a constant-zero ref
// just fine. Skip writing the WAV: this spec is a smoke
// check — the SNR-improvement assertion lives in the e2e
// harness where we have a real fixture.
})
})
})

View File

@@ -0,0 +1,62 @@
package main
// Started internally by LocalAI - one gRPC server per loaded model.
import (
"flag"
"os"
"github.com/ebitengine/purego"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
type LibFuncs struct {
FuncPtr any
Name string
}
func main() {
libName := os.Getenv("LOCALVQE_LIBRARY")
if libName == "" {
libName = "./liblocalvqe.so"
}
lib, err := purego.Dlopen(libName, purego.RTLD_NOW|purego.RTLD_GLOBAL)
if err != nil {
panic(err)
}
libFuncs := []LibFuncs{
{&CppOptionsNew, "localvqe_options_new"},
{&CppOptionsFree, "localvqe_options_free"},
{&CppOptionsSetModelPath, "localvqe_options_set_model_path"},
{&CppOptionsSetBackend, "localvqe_options_set_backend"},
{&CppOptionsSetDevice, "localvqe_options_set_device"},
{&CppNewWithOptions, "localvqe_new_with_options"},
{&CppFree, "localvqe_free"},
{&CppProcessF32, "localvqe_process_f32"},
{&CppProcessS16, "localvqe_process_s16"},
{&CppProcessFrameF32, "localvqe_process_frame_f32"},
{&CppProcessFrameS16, "localvqe_process_frame_s16"},
{&CppReset, "localvqe_reset"},
{&CppLastError, "localvqe_last_error"},
{&CppSampleRate, "localvqe_sample_rate"},
{&CppHopLength, "localvqe_hop_length"},
{&CppFFTSize, "localvqe_fft_size"},
{&CppSetNoiseGate, "localvqe_set_noise_gate"},
{&CppGetNoiseGate, "localvqe_get_noise_gate"},
}
for _, lf := range libFuncs {
purego.RegisterLibFunc(lf.FuncPtr, lib, lf.Name)
}
flag.Parse()
if err := grpc.StartServer(*addr, &LocalVQE{}); err != nil {
panic(err)
}
}

61
backend/go/localvqe/package.sh Executable file
View File

@@ -0,0 +1,61 @@
#!/bin/bash
# Bundle the localvqe binary, the upstream liblocalvqe.so + the per-CPU
# libggml-*.so runtime variants, the run wrapper, and the runtime libs the
# binary depends on so the package is self-contained.
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
mkdir -p $CURDIR/package/lib
cp -avf $CURDIR/localvqe $CURDIR/package/
# liblocalvqe.so* (with SOVERSION symlinks) and the libggml-*.so runtime
# variants — LocalVQE picks the matching CPU variant at load time.
cp -P $CURDIR/liblocalvqe.so* $CURDIR/package/ 2>/dev/null || true
cp -P $CURDIR/libggml*.so* $CURDIR/package/ 2>/dev/null || true
cp -fv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ $(uname -s) = "Darwin" ]; then
echo "Detected Darwin"
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

23
backend/go/localvqe/run.sh Executable file
View File

@@ -0,0 +1,23 @@
#!/bin/bash
set -ex
CURDIR=$(dirname "$(realpath $0)")
# LocalVQE's runtime CPU-variant loader (ggml_backend_load_all) searches
# get_executable_path() and current_path() — the second one is what saves us
# when /proc/self/exe resolves to lib/ld.so under the bundled-loader path.
# So we cd into $CURDIR (where all the libggml-cpu-*.so files live) before
# exec'ing the binary.
cd "$CURDIR"
export LD_LIBRARY_PATH=$CURDIR:$CURDIR/lib:$LD_LIBRARY_PATH
export LOCALVQE_LIBRARY=$CURDIR/liblocalvqe.so
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using library: $LOCALVQE_LIBRARY"
exec $CURDIR/lib/ld.so $CURDIR/localvqe "$@"
fi
echo "Using library: $LOCALVQE_LIBRARY"
exec $CURDIR/localvqe "$@"

14
backend/go/localvqe/test.sh Executable file
View File

@@ -0,0 +1,14 @@
#!/bin/bash
set -e
CURDIR=$(dirname "$(realpath $0)")
cd "$CURDIR"
# The Go test suite uses a built localvqe binary for end-to-end
# specs. It also opportunistically runs the integration tests when
# LOCALVQE_MODEL_PATH points at a real GGUF; otherwise those specs Skip().
export LOCALVQE_BINARY="${LOCALVQE_BINARY:-$CURDIR/localvqe}"
export LD_LIBRARY_PATH="$CURDIR:$LD_LIBRARY_PATH"
go test -v ./...

View File

@@ -10,7 +10,7 @@ set(SAM3_BUILD_TESTS OFF CACHE BOOL "Disable sam3.cpp tests" FORCE)
add_subdirectory(./sources/sam3.cpp)
add_library(gosam3 MODULE gosam3.cpp)
add_library(gosam3 MODULE cpp/gosam3.cpp)
target_link_libraries(gosam3 PRIVATE sam3 ggml)
if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 9.0)

View File

@@ -111,7 +111,7 @@ libgosam3-fallback.so: sources/sam3.cpp
SO_TARGET=libgosam3-fallback.so CMAKE_ARGS="$(CMAKE_ARGS) -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off" $(MAKE) libgosam3-custom
rm -rfv build*
libgosam3-custom: CMakeLists.txt gosam3.cpp gosam3.h
libgosam3-custom: CMakeLists.txt cpp/gosam3.cpp cpp/gosam3.h
mkdir -p build-$(SO_TARGET) && \
cd build-$(SO_TARGET) && \
cmake .. $(CMAKE_ARGS) && \

11
backend/go/sherpa-onnx/.gitignore vendored Normal file
View File

@@ -0,0 +1,11 @@
.cache/
sources/
build*/
package/
backend-assets/
sherpa-onnx
*.so
compile_commands.json
sherpa-onnx-whisper-*
vits-ljs/
streaming-zipformer-en/

View File

@@ -0,0 +1,120 @@
CURRENT_DIR=$(abspath ./)
GOCMD=go
ONNX_VERSION?=1.24.4
# v1.12.39 — includes upstream's onnxruntime 1.24.4 bump (#3501). Earlier
# pinned commits only support onnxruntime 1.23.2, which has no CUDA 13
# pre-built tarball, blocking the -gpu-nvidia-cuda-13 build matrix entry.
SHERPA_COMMIT?=7288d15e3e31a7bd589b2ba88828d521e7a6b140
ONNX_ARCH?=x64
ONNX_OS?=linux
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=aarch64
endif
ifeq ($(OS),Darwin)
ONNX_OS=osx
ifneq (,$(findstring aarch64,$(shell uname -m)))
ONNX_ARCH=arm64
else ifneq (,$(findstring arm64,$(shell uname -m)))
ONNX_ARCH=arm64
else
ONNX_ARCH=x86_64
endif
endif
# Upstream onnxruntime ships CUDA 12 and CUDA 13 variants under different
# names: -gpu-<ver>.tgz for CUDA 12, -gpu_cuda13-<ver>.tgz for CUDA 13
# (note underscore vs dash). CUDA 13 tarballs only exist from 1.24.x onward.
ifeq ($(BUILD_TYPE),cublas)
SHERPA_GPU=ON
ONNX_PROVIDER=cuda
ifeq ($(CUDA_MAJOR_VERSION),13)
ONNX_VARIANT=-gpu_cuda13
else
ONNX_VARIANT=-gpu
endif
else
ONNX_VARIANT=
SHERPA_GPU=OFF
ONNX_PROVIDER=cpu
endif
JOBS?=$(shell nproc --ignore=1 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)
sources/onnxruntime:
mkdir -p sources/onnxruntime
curl -L https://github.com/microsoft/onnxruntime/releases/download/v$(ONNX_VERSION)/onnxruntime-$(ONNX_OS)-$(ONNX_ARCH)$(ONNX_VARIANT)-$(ONNX_VERSION).tgz \
-o sources/onnxruntime/onnxruntime.tgz
cd sources/onnxruntime && tar -xf onnxruntime.tgz --strip-components=1 && rm onnxruntime.tgz
sources/sherpa-onnx: sources/onnxruntime
git clone https://github.com/k2-fsa/sherpa-onnx.git sources/sherpa-onnx
cd sources/sherpa-onnx && git checkout $(SHERPA_COMMIT)
mkdir -p sources/sherpa-onnx/build
# sherpa-onnx's cmake detects a pre-installed onnxruntime via the
# SHERPA_ONNXRUNTIME_{INCLUDE,LIB}_DIR env vars (not via -D flags).
# Point them at our locally-downloaded Microsoft tarball — without
# this, sherpa-onnx falls through to download_onnxruntime() which
# fetches from csukuangfj/onnxruntime-libs. For the GPU 1.24.4
# build that release mirror publishes `-patched.zip` instead of the
# expected `.tgz`, so the download 404s and the build fails.
cd sources/sherpa-onnx/build && \
SHERPA_ONNXRUNTIME_INCLUDE_DIR=$(CURRENT_DIR)/sources/onnxruntime/include \
SHERPA_ONNXRUNTIME_LIB_DIR=$(CURRENT_DIR)/sources/onnxruntime/lib \
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-Wno-error=format-security" \
-DCMAKE_CXX_FLAGS="-Wno-error=format-security" \
-DSHERPA_ONNX_ENABLE_GPU=$(SHERPA_GPU) \
-DSHERPA_ONNX_ENABLE_TTS=ON \
-DSHERPA_ONNX_ENABLE_BINARY=OFF \
-DSHERPA_ONNX_ENABLE_PYTHON=OFF \
-DSHERPA_ONNX_ENABLE_TESTS=OFF \
-DSHERPA_ONNX_ENABLE_C_API=ON \
-DBUILD_SHARED_LIBS=ON \
-DSHERPA_ONNX_USE_PRE_INSTALLED_ONNXRUNTIME_IF_AVAILABLE=ON \
..
cd sources/sherpa-onnx/build && make -j$(JOBS)
backend-assets/lib: sources/sherpa-onnx sources/onnxruntime
mkdir -p backend-assets/lib
cp -rfLv sources/onnxruntime/lib/* backend-assets/lib/
cp -rfLv sources/sherpa-onnx/build/lib/*.so* backend-assets/lib/ 2>/dev/null || true
cp -rfLv sources/sherpa-onnx/build/lib/*.dylib backend-assets/lib/ 2>/dev/null || true
# libsherpa-shim wraps sherpa-onnx's nested config structs and TTS
# callback plumbing behind a purego-friendly API: opaque handles plus
# fixed-signature setters/getters/trampoline. Plain C compile — no cgo.
SHIM_EXT=so
ifeq ($(OS),Darwin)
SHIM_EXT=dylib
endif
backend-assets/lib/libsherpa-shim.$(SHIM_EXT): csrc/shim.c csrc/shim.h backend-assets/lib
$(CC) -shared -fPIC -O2 \
-I$(CURRENT_DIR)/sources/sherpa-onnx/sherpa-onnx/c-api \
-o $@ csrc/shim.c \
-L$(CURRENT_DIR)/backend-assets/lib \
-lsherpa-onnx-c-api \
-Wl,-rpath,'$$ORIGIN'
sherpa-onnx: backend-assets/lib backend-assets/lib/libsherpa-shim.$(SHIM_EXT)
CGO_ENABLED=0 $(GOCMD) build \
-ldflags "$(LD_FLAGS) -X main.onnxProvider=$(ONNX_PROVIDER)" \
-tags "$(GO_TAGS)" -o sherpa-onnx ./
package:
bash package.sh
build: sherpa-onnx package
clean:
rm -rf sherpa-onnx sources/ backend-assets/ package/ vits-ljs/ sherpa-onnx-whisper-*/
test: sherpa-onnx
LD_LIBRARY_PATH=$(CURRENT_DIR)/backend-assets/lib \
bash test.sh
.PHONY: build package clean test

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,170 @@
package main
import (
"context"
"os"
"path/filepath"
"testing"
pb "github.com/mudler/LocalAI/pkg/grpc/proto"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
)
func TestSherpaBackend(t *testing.T) {
RegisterFailHandler(Fail)
RunSpecs(t, "Sherpa-ONNX Backend Suite")
}
// Load libsherpa-shim + libsherpa-onnx-c-api via purego before any spec
// runs — otherwise any Load/TTS/VAD/AudioTranscription call hits a nil
// function pointer. LD_LIBRARY_PATH must contain the directory holding
// both .so files; test.sh sets this.
var _ = BeforeSuite(func() {
Expect(loadSherpaLibs()).To(Succeed())
})
var _ = Describe("Sherpa-ONNX", func() {
Context("lifecycle", func() {
It("is locking (C API is not thread safe)", func() {
Expect((&SherpaBackend{}).Locking()).To(BeTrue())
})
It("errors loading a non-existent model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-nonexistent")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "non-existent-model.onnx"),
})
Expect(err).To(HaveOccurred())
})
It("errors loading a non-existent ASR model", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-asr")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).Load(&pb.ModelOptions{
ModelFile: filepath.Join(tmpDir, "model.onnx"),
Type: "asr",
})
Expect(err).To(HaveOccurred())
})
It("dispatches Load by Type", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-dispatch")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
modelFile := filepath.Join(tmpDir, "model.onnx")
for _, typ := range []string{"", "asr", "vad"} {
err := (&SherpaBackend{}).Load(&pb.ModelOptions{ModelFile: modelFile, Type: typ})
Expect(err).To(HaveOccurred(), "Type=%q", typ)
}
})
})
Context("method errors without loaded model", func() {
It("rejects TTS", func() {
tmpDir, err := os.MkdirTemp("", "sherpa-test-tts")
Expect(err).ToNot(HaveOccurred())
defer os.RemoveAll(tmpDir)
err = (&SherpaBackend{}).TTS(&pb.TTSRequest{
Text: "should fail — no model loaded",
Dst: filepath.Join(tmpDir, "output.wav"),
})
Expect(err).To(HaveOccurred())
})
It("rejects AudioTranscription", func() {
_, err := (&SherpaBackend{}).AudioTranscription(context.Background(), &pb.TranscriptRequest{
Dst: "/tmp/nonexistent.wav",
})
Expect(err).To(HaveOccurred())
})
It("rejects VAD", func() {
_, err := (&SherpaBackend{}).VAD(&pb.VADRequest{
Audio: []float32{0.1, 0.2, 0.3},
})
Expect(err).To(HaveOccurred())
})
})
Context("type detection", func() {
DescribeTable("isASRType",
func(input string, want bool) {
Expect(isASRType(input)).To(Equal(want))
},
Entry("asr", "asr", true),
Entry("ASR", "ASR", true),
Entry("Asr", "Asr", true),
Entry("transcription", "transcription", true),
Entry("Transcription", "Transcription", true),
Entry("transcribe", "transcribe", true),
Entry("Transcribe", "Transcribe", true),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
Entry("vad", "vad", false),
)
DescribeTable("isVADType",
func(input string, want bool) {
Expect(isVADType(input)).To(Equal(want))
},
Entry("vad", "vad", true),
Entry("VAD", "VAD", true),
Entry("Vad", "Vad", true),
Entry("asr", "asr", false),
Entry("tts", "tts", false),
Entry("empty", "", false),
Entry("other", "other", false),
)
})
Context("option parsing", func() {
It("parses float options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"vad.threshold=0.3",
"tts.length_scale=1.25",
"bad.number=not-a-float",
}}
Expect(findOptionFloat(opts, "vad.threshold=", 0.5)).To(BeNumerically("~", 0.3, 1e-6))
Expect(findOptionFloat(opts, "tts.length_scale=", 1.0)).To(BeNumerically("~", 1.25, 1e-6))
Expect(findOptionFloat(opts, "missing.key=", 0.7)).To(BeNumerically("~", 0.7, 1e-6))
Expect(findOptionFloat(opts, "bad.number=", 9.9)).To(BeNumerically("~", 9.9, 1e-6))
})
It("parses int options with fallback on bad input", func() {
opts := &pb.ModelOptions{Options: []string{
"asr.sample_rate=22050",
"online.chunk_samples=800",
"bad.int=4.2",
}}
Expect(findOptionInt(opts, "asr.sample_rate=", 16000)).To(Equal(int32(22050)))
Expect(findOptionInt(opts, "online.chunk_samples=", 1600)).To(Equal(int32(800)))
Expect(findOptionInt(opts, "missing.key=", 42)).To(Equal(int32(42)))
Expect(findOptionInt(opts, "bad.int=", 100)).To(Equal(int32(100)))
})
It("parses bool options (0/1, true/false, yes/no, on/off)", func() {
opts := &pb.ModelOptions{Options: []string{
"online.enable_endpoint=0",
"asr.sense_voice.use_itn=True",
"feature.on=yes",
"feature.off=Off",
"feature.bad=maybe",
}}
Expect(findOptionBool(opts, "online.enable_endpoint=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "asr.sense_voice.use_itn=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.on=", 0)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "feature.off=", 1)).To(Equal(int32(0)))
Expect(findOptionBool(opts, "feature.bad=", 1)).To(Equal(int32(1)))
Expect(findOptionBool(opts, "missing.key=", 1)).To(Equal(int32(1)))
})
})
})

View File

@@ -0,0 +1,406 @@
#include "shim.h"
#include "c-api.h"
#include <stdlib.h>
#include <string.h>
// Replace the char* field pointed to by `slot` with a strdup of `s`
// (or NULL if s is NULL). Frees any prior value. Silently no-ops when
// strdup fails — the caller will see a Create* failure downstream.
static void shim_set_str(const char **slot, const char *s) {
free((char *)*slot);
*slot = s ? strdup(s) : NULL;
}
// ==================================================================
// VAD config
// ==================================================================
void *sherpa_shim_vad_config_new(void) {
return calloc(1, sizeof(SherpaOnnxVadModelConfig));
}
void sherpa_shim_vad_config_free(void *h) {
if (!h) return;
SherpaOnnxVadModelConfig *c = (SherpaOnnxVadModelConfig *)h;
free((char *)c->silero_vad.model);
free((char *)c->provider);
free(c);
}
void sherpa_shim_vad_config_set_silero_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->silero_vad.model, v);
}
void sherpa_shim_vad_config_set_silero_threshold(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.threshold = v;
}
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_silence_duration = v;
}
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.min_speech_duration = v;
}
void sherpa_shim_vad_config_set_silero_window_size(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.window_size = v;
}
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *h, float v) {
((SherpaOnnxVadModelConfig *)h)->silero_vad.max_speech_duration = v;
}
void sherpa_shim_vad_config_set_sample_rate(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->sample_rate = v;
}
void sherpa_shim_vad_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->num_threads = v;
}
void sherpa_shim_vad_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxVadModelConfig *)h)->provider, v);
}
void sherpa_shim_vad_config_set_debug(void *h, int32_t v) {
((SherpaOnnxVadModelConfig *)h)->debug = v;
}
void *sherpa_shim_create_vad(void *h, float buffer_size_seconds) {
return (void *)SherpaOnnxCreateVoiceActivityDetector(
(const SherpaOnnxVadModelConfig *)h, buffer_size_seconds);
}
// ==================================================================
// Offline TTS config (VITS)
// ==================================================================
void *sherpa_shim_tts_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineTtsConfig));
}
void sherpa_shim_tts_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineTtsConfig *c = (SherpaOnnxOfflineTtsConfig *)h;
free((char *)c->model.vits.model);
free((char *)c->model.vits.tokens);
free((char *)c->model.vits.lexicon);
free((char *)c->model.vits.data_dir);
free((char *)c->model.provider);
free(c);
}
void sherpa_shim_tts_config_set_vits_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.model, v);
}
void sherpa_shim_tts_config_set_vits_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.tokens, v);
}
void sherpa_shim_tts_config_set_vits_lexicon(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.lexicon, v);
}
void sherpa_shim_tts_config_set_vits_data_dir(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.vits.data_dir, v);
}
void sherpa_shim_tts_config_set_vits_noise_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale = v;
}
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.noise_scale_w = v;
}
void sherpa_shim_tts_config_set_vits_length_scale(void *h, float v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.vits.length_scale = v;
}
void sherpa_shim_tts_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.num_threads = v;
}
void sherpa_shim_tts_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->model.debug = v;
}
void sherpa_shim_tts_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineTtsConfig *)h)->model.provider, v);
}
void sherpa_shim_tts_config_set_max_num_sentences(void *h, int32_t v) {
((SherpaOnnxOfflineTtsConfig *)h)->max_num_sentences = v;
}
void *sherpa_shim_create_offline_tts(void *h) {
return (void *)SherpaOnnxCreateOfflineTts(
(const SherpaOnnxOfflineTtsConfig *)h);
}
// ==================================================================
// Offline recognizer config
// ==================================================================
void *sherpa_shim_offline_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineRecognizerConfig));
}
void sherpa_shim_offline_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineRecognizerConfig *c = (SherpaOnnxOfflineRecognizerConfig *)h;
free((char *)c->model_config.provider);
free((char *)c->model_config.tokens);
free((char *)c->model_config.whisper.encoder);
free((char *)c->model_config.whisper.decoder);
free((char *)c->model_config.whisper.language);
free((char *)c->model_config.whisper.task);
free((char *)c->model_config.paraformer.model);
free((char *)c->model_config.sense_voice.model);
free((char *)c->model_config.sense_voice.language);
free((char *)c->model_config.omnilingual.model);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_offline_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_offline_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_offline_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_offline_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_offline_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.encoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.decoder, v);
}
void sherpa_shim_offline_recog_config_set_whisper_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.language, v);
}
void sherpa_shim_offline_recog_config_set_whisper_task(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.task, v);
}
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.whisper.tail_paddings = v;
}
void sherpa_shim_offline_recog_config_set_paraformer_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.paraformer.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.model, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.language, v);
}
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *h, int32_t v) {
((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.sense_voice.use_itn = v;
}
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineRecognizerConfig *)h)->model_config.omnilingual.model, v);
}
void *sherpa_shim_create_offline_recognizer(void *h) {
return (void *)SherpaOnnxCreateOfflineRecognizer(
(const SherpaOnnxOfflineRecognizerConfig *)h);
}
// ==================================================================
// Online recognizer config
// ==================================================================
void *sherpa_shim_online_recog_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOnlineRecognizerConfig));
}
void sherpa_shim_online_recog_config_free(void *h) {
if (!h) return;
SherpaOnnxOnlineRecognizerConfig *c = (SherpaOnnxOnlineRecognizerConfig *)h;
free((char *)c->model_config.transducer.encoder);
free((char *)c->model_config.transducer.decoder);
free((char *)c->model_config.transducer.joiner);
free((char *)c->model_config.tokens);
free((char *)c->model_config.provider);
free((char *)c->decoding_method);
free(c);
}
void sherpa_shim_online_recog_config_set_transducer_encoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.encoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_decoder(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.decoder, v);
}
void sherpa_shim_online_recog_config_set_transducer_joiner(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.transducer.joiner, v);
}
void sherpa_shim_online_recog_config_set_tokens(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.tokens, v);
}
void sherpa_shim_online_recog_config_set_num_threads(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.num_threads = v;
}
void sherpa_shim_online_recog_config_set_debug(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.debug = v;
}
void sherpa_shim_online_recog_config_set_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->model_config.provider, v);
}
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.sample_rate = v;
}
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->feat_config.feature_dim = v;
}
void sherpa_shim_online_recog_config_set_decoding_method(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOnlineRecognizerConfig *)h)->decoding_method, v);
}
void sherpa_shim_online_recog_config_set_enable_endpoint(void *h, int32_t v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->enable_endpoint = v;
}
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule1_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule2_min_trailing_silence = v;
}
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *h, float v) {
((SherpaOnnxOnlineRecognizerConfig *)h)->rule3_min_utterance_length = v;
}
void *sherpa_shim_create_online_recognizer(void *h) {
return (void *)SherpaOnnxCreateOnlineRecognizer(
(const SherpaOnnxOnlineRecognizerConfig *)h);
}
// ==================================================================
// Result-struct accessors
// ==================================================================
int32_t sherpa_shim_wave_sample_rate(const void *h) {
return ((const SherpaOnnxWave *)h)->sample_rate;
}
int32_t sherpa_shim_wave_num_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->num_samples;
}
const float *sherpa_shim_wave_samples(const void *h) {
return ((const SherpaOnnxWave *)h)->samples;
}
const char *sherpa_shim_offline_result_text(const void *h) {
return ((const SherpaOnnxOfflineRecognizerResult *)h)->text;
}
const char *sherpa_shim_online_result_text(const void *h) {
return ((const SherpaOnnxOnlineRecognizerResult *)h)->text;
}
int32_t sherpa_shim_generated_audio_sample_rate(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->sample_rate;
}
int32_t sherpa_shim_generated_audio_n(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->n;
}
const float *sherpa_shim_generated_audio_samples(const void *h) {
return ((const SherpaOnnxGeneratedAudio *)h)->samples;
}
int32_t sherpa_shim_speech_segment_start(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->start;
}
int32_t sherpa_shim_speech_segment_n(const void *h) {
return ((const SherpaOnnxSpeechSegment *)h)->n;
}
// ==================================================================
// Offline speaker diarization config
// ==================================================================
void *sherpa_shim_diarize_config_new(void) {
return calloc(1, sizeof(SherpaOnnxOfflineSpeakerDiarizationConfig));
}
void sherpa_shim_diarize_config_free(void *h) {
if (!h) return;
SherpaOnnxOfflineSpeakerDiarizationConfig *c =
(SherpaOnnxOfflineSpeakerDiarizationConfig *)h;
free((char *)c->segmentation.pyannote.model);
free((char *)c->segmentation.provider);
free((char *)c->embedding.model);
free((char *)c->embedding.provider);
free(c);
}
void sherpa_shim_diarize_config_set_segmentation_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->segmentation.pyannote.model, v);
}
void sherpa_shim_diarize_config_set_segmentation_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->segmentation.num_threads = v;
}
void sherpa_shim_diarize_config_set_segmentation_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->segmentation.provider, v);
}
void sherpa_shim_diarize_config_set_segmentation_debug(void *h, int32_t v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->segmentation.debug = v;
}
void sherpa_shim_diarize_config_set_embedding_model(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->embedding.model, v);
}
void sherpa_shim_diarize_config_set_embedding_num_threads(void *h, int32_t v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->embedding.num_threads = v;
}
void sherpa_shim_diarize_config_set_embedding_provider(void *h, const char *v) {
shim_set_str(&((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->embedding.provider, v);
}
void sherpa_shim_diarize_config_set_embedding_debug(void *h, int32_t v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->embedding.debug = v;
}
void sherpa_shim_diarize_config_set_clustering_num_clusters(void *h, int32_t v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->clustering.num_clusters = v;
}
void sherpa_shim_diarize_config_set_clustering_threshold(void *h, float v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->clustering.threshold = v;
}
void sherpa_shim_diarize_config_set_min_duration_on(void *h, float v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->min_duration_on = v;
}
void sherpa_shim_diarize_config_set_min_duration_off(void *h, float v) {
((SherpaOnnxOfflineSpeakerDiarizationConfig *)h)->min_duration_off = v;
}
void *sherpa_shim_create_offline_speaker_diarization(void *h) {
return (void *)SherpaOnnxCreateOfflineSpeakerDiarization(
(const SherpaOnnxOfflineSpeakerDiarizationConfig *)h);
}
void sherpa_shim_diarize_set_clustering(void *sd, int32_t num_clusters, float threshold) {
if (!sd) return;
SherpaOnnxOfflineSpeakerDiarizationConfig cfg;
memset(&cfg, 0, sizeof(cfg));
cfg.clustering.num_clusters = num_clusters;
cfg.clustering.threshold = threshold;
SherpaOnnxOfflineSpeakerDiarizationSetConfig(
(const SherpaOnnxOfflineSpeakerDiarization *)sd, &cfg);
}
void sherpa_shim_diarize_segment_at(const void *segs, int32_t i,
float *out_start, float *out_end,
int32_t *out_speaker) {
const SherpaOnnxOfflineSpeakerDiarizationSegment *arr =
(const SherpaOnnxOfflineSpeakerDiarizationSegment *)segs;
if (out_start) *out_start = arr[i].start;
if (out_end) *out_end = arr[i].end;
if (out_speaker) *out_speaker = arr[i].speaker;
}
// ==================================================================
// TTS streaming callback trampoline
// ==================================================================
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data) {
SherpaOnnxGeneratedAudioCallbackWithArg cb =
(SherpaOnnxGeneratedAudioCallbackWithArg)callback_ptr;
return (void *)SherpaOnnxOfflineTtsGenerateWithCallbackWithArg(
(const SherpaOnnxOfflineTts *)tts, text, sid, speed, cb,
(void *)user_data);
}

View File

@@ -0,0 +1,164 @@
#ifndef LOCALAI_SHERPA_ONNX_SHIM_H
#define LOCALAI_SHERPA_ONNX_SHIM_H
#include <stdint.h>
// libsherpa-shim: purego-friendly wrapper around sherpa-onnx's C API.
// Purego can't access C struct fields and can't route C callbacks to Go
// funcs directly. Every function here is a fixed-signature trampoline
// that replaces one field read/write or callback handoff that the Go
// backend would otherwise have to do through cgo.
//
// String lifetime: setters strdup; _free walks every owned string and
// frees it. Callers may discard their input buffers the moment a setter
// returns.
//
// Opaque handles are `void *` in both directions. Nothing here holds a
// reference across calls except config handles (freed via _free) and
// sherpa-allocated results (freed via sherpa's own Destroy* entry
// points, which Go calls through purego pass-through).
#ifdef __cplusplus
extern "C" {
#endif
// --- VAD config -----------------------------------------------------
void *sherpa_shim_vad_config_new(void);
void sherpa_shim_vad_config_free(void *cfg);
void sherpa_shim_vad_config_set_silero_model(void *cfg, const char *path);
void sherpa_shim_vad_config_set_silero_threshold(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_silence_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_min_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_silero_window_size(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_silero_max_speech_duration(void *cfg, float v);
void sherpa_shim_vad_config_set_sample_rate(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_vad_config_set_provider(void *cfg, const char *v);
void sherpa_shim_vad_config_set_debug(void *cfg, int32_t v);
void *sherpa_shim_create_vad(void *cfg, float buffer_size_seconds);
// --- Offline TTS config (VITS path — the only TTS family the backend uses) ---
void *sherpa_shim_tts_config_new(void);
void sherpa_shim_tts_config_free(void *cfg);
void sherpa_shim_tts_config_set_vits_model(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_tokens(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_lexicon(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_data_dir(void *cfg, const char *v);
void sherpa_shim_tts_config_set_vits_noise_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_noise_scale_w(void *cfg, float v);
void sherpa_shim_tts_config_set_vits_length_scale(void *cfg, float v);
void sherpa_shim_tts_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_tts_config_set_provider(void *cfg, const char *v);
void sherpa_shim_tts_config_set_max_num_sentences(void *cfg, int32_t v);
void *sherpa_shim_create_offline_tts(void *cfg);
// --- Offline recognizer config (Whisper / Paraformer / SenseVoice / Omnilingual) ---
void *sherpa_shim_offline_recog_config_new(void);
void sherpa_shim_offline_recog_config_free(void *cfg);
void sherpa_shim_offline_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_encoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_decoder(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_task(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_whisper_tail_paddings(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_paraformer_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_model(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_language(void *cfg, const char *v);
void sherpa_shim_offline_recog_config_set_sense_voice_use_itn(void *cfg, int32_t v);
void sherpa_shim_offline_recog_config_set_omnilingual_model(void *cfg, const char *v);
void *sherpa_shim_create_offline_recognizer(void *cfg);
// --- Online recognizer config (streaming zipformer transducer) ---
void *sherpa_shim_online_recog_config_new(void);
void sherpa_shim_online_recog_config_free(void *cfg);
void sherpa_shim_online_recog_config_set_transducer_encoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_decoder(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_transducer_joiner(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_tokens(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_num_threads(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_debug(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_provider(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_feat_sample_rate(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_feat_feature_dim(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_decoding_method(void *cfg, const char *v);
void sherpa_shim_online_recog_config_set_enable_endpoint(void *cfg, int32_t v);
void sherpa_shim_online_recog_config_set_rule1_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule2_min_trailing_silence(void *cfg, float v);
void sherpa_shim_online_recog_config_set_rule3_min_utterance_length(void *cfg, float v);
void *sherpa_shim_create_online_recognizer(void *cfg);
// --- Result accessors (sherpa-allocated; caller destroys via sherpa's own Destroy*) ---
int32_t sherpa_shim_wave_sample_rate(const void *wave);
int32_t sherpa_shim_wave_num_samples(const void *wave);
const float *sherpa_shim_wave_samples(const void *wave);
const char *sherpa_shim_offline_result_text(const void *result);
const char *sherpa_shim_online_result_text(const void *result);
int32_t sherpa_shim_generated_audio_sample_rate(const void *audio);
int32_t sherpa_shim_generated_audio_n(const void *audio);
const float *sherpa_shim_generated_audio_samples(const void *audio);
int32_t sherpa_shim_speech_segment_start(const void *seg);
int32_t sherpa_shim_speech_segment_n(const void *seg);
// --- Offline speaker diarization config -----------------------------
// Pyannote segmentation + speaker-embedding extractor + fast clustering.
// The upstream config is a struct of nested structs; purego can't read or
// build those across dlopen, so we expose a calloc'd opaque holder plus
// flat setters, then hand it to sherpa via the create wrapper.
void *sherpa_shim_diarize_config_new(void);
void sherpa_shim_diarize_config_free(void *cfg);
void sherpa_shim_diarize_config_set_segmentation_model(void *cfg, const char *path);
void sherpa_shim_diarize_config_set_segmentation_num_threads(void *cfg, int32_t v);
void sherpa_shim_diarize_config_set_segmentation_provider(void *cfg, const char *v);
void sherpa_shim_diarize_config_set_segmentation_debug(void *cfg, int32_t v);
void sherpa_shim_diarize_config_set_embedding_model(void *cfg, const char *path);
void sherpa_shim_diarize_config_set_embedding_num_threads(void *cfg, int32_t v);
void sherpa_shim_diarize_config_set_embedding_provider(void *cfg, const char *v);
void sherpa_shim_diarize_config_set_embedding_debug(void *cfg, int32_t v);
void sherpa_shim_diarize_config_set_clustering_num_clusters(void *cfg, int32_t v);
void sherpa_shim_diarize_config_set_clustering_threshold(void *cfg, float v);
void sherpa_shim_diarize_config_set_min_duration_on(void *cfg, float v);
void sherpa_shim_diarize_config_set_min_duration_off(void *cfg, float v);
void *sherpa_shim_create_offline_speaker_diarization(void *cfg);
// Apply just the clustering knobs onto a loaded diarizer (sherpa
// supports re-clustering after Create), so per-call overrides like
// num_speakers don't require re-loading the heavy ONNX models.
void sherpa_shim_diarize_set_clustering(void *sd, int32_t num_clusters, float threshold);
// Sherpa's ResultSortByStartTime returns a sherpa-allocated array of
// SherpaOnnxOfflineSpeakerDiarizationSegment structs (free with
// SherpaOnnxOfflineSpeakerDiarizationDestroySegment). Purego can't read
// fields out of an array of C structs, so this getter copies one
// segment's fields into the caller-supplied float/int32 cells.
void sherpa_shim_diarize_segment_at(const void *segs, int32_t i,
float *out_start, float *out_end,
int32_t *out_speaker);
// --- TTS streaming callback trampoline -----------------------------
// Replaces the //export sherpaTtsGoCallback + callbacks.c bridge pattern.
// `callback_ptr` is the C-callable function pointer returned by
// purego.NewCallback. `user_data` is an integer the Go side uses to
// look up its state (sync.Map keyed by uint64).
//
// Returns the sherpa-allocated SherpaOnnxGeneratedAudio. Destroy with
// SherpaOnnxDestroyOfflineTtsGeneratedAudio (callable directly from
// Go via purego).
void *sherpa_shim_tts_generate_with_callback(
void *tts, const char *text, int32_t sid, float speed,
uintptr_t callback_ptr, uintptr_t user_data);
#ifdef __cplusplus
}
#endif
#endif

View File

@@ -0,0 +1,23 @@
package main
import (
"flag"
grpc "github.com/mudler/LocalAI/pkg/grpc"
)
var (
addr = flag.String("addr", "localhost:50051", "the address to connect to")
)
func main() {
flag.Parse()
if err := loadSherpaLibs(); err != nil {
panic(err)
}
if err := grpc.StartServer(*addr, &SherpaBackend{}); err != nil {
panic(err)
}
}

Some files were not shown because too many files have changed in this diff Show More