Commit Graph

6581 Commits

Author SHA1 Message Date
Ettore Di Giacinto
9f41e69bc3 fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.

InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 23:00:50 +00:00
Adira
ef80a0e825 fix(config): add face/speaker recognition constants and register insightface + speaker-recognition (#10110)
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.

Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.

Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
  MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
  UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
  existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze

Follows up on #10107 which left these two out because they needed new
constants first.

Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
2026-06-04 21:48:01 +02:00
LocalAI [bot]
92726f7631 fix(distributed): stage directory-based models to remote nodes (#10175)
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.

stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.

Assisted-by: Claude:claude-opus-4-8 go vet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 18:05:38 +02:00
LocalAI [bot]
994063ba9a feat(qwen3-tts-cpp): normalize request language for flexible matching (#10174)
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).

Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 17:26:31 +02:00
LocalAI [bot]
c1a55cf72d chore: ⬆️ Update mudler/parakeet.cpp to b11fe5bca78ad8b342dd559a43d76df3984bb447 (#10167)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:07:09 +02:00
LocalAI [bot]
96758841d8 chore: ⬆️ Update predict-woo/qwen3-tts.cpp to 136e5d36c17083da0321fd96512dc7b263f94a44 (#10165)
⬆️ Update predict-woo/qwen3-tts.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:55 +02:00
LocalAI [bot]
7a59260621 chore: ⬆️ Update CrispStrobe/CrispASR to 13d54e110e1538e0f0bc3af0680b9ab246cfb48d (#10145)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 12:06:32 +02:00
LocalAI [bot]
27e63b9a78 feat(tts): support per-request instructions and params (#10172)
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.

Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:

- proto: add `optional string instructions` and `map<string,string> params`
  to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
  extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
  that attaches instructions only when non-empty (so backends can fall back to
  YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
  (used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
  options into generate() kwargs.

Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.

Closes #10164


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 11:45:02 +02:00
LocalAI [bot]
55c0911c23 chore: ⬆️ Update leejet/stable-diffusion.cpp to 1f9ee88e09c258053fa59d5e05e23dfb10fa0b13 (#10166)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:34 +02:00
LocalAI [bot]
f6cb6ab6d9 chore: ⬆️ Update ggml-org/llama.cpp to 94a220cd6745e6e3f8de62870b66fd5b9bc92700 (#10168)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 09:34:13 +02:00
LocalAI [bot]
9f11b09c6a chore(model-gallery): ⬆️ update checksum (#10169)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-04 00:32:15 +02:00
LocalAI [bot]
a5c4f822f0 chore: ⬆️ Update antirez/ds4 to 477c0e82e2699b35a65fd0a1ed6fe66b41087dfe (#10142)
⬆️ Update antirez/ds4

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:45:23 +02:00
LocalAI [bot]
fb36c262fe chore(model gallery): 🤖 add 1 new models via gallery agent (#10163)
chore(model gallery): 🤖 add new models via gallery agent

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:51 +02:00
LocalAI [bot]
0e4e8980e6 chore: ⬆️ Update ggml-org/llama.cpp to 5c394fdc8b564eff6faacc50a139529d875f0e36 (#10143)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 19:44:21 +02:00
Richard Palethorpe
3a932a9803 feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159)
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage

Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.

Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.

Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(distributed): correct NATS JWT scoping and harden client auth

The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:

- Agent-worker minted JWTs did not allow the subjects the agent worker
  actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
  so MCP-CI jobs and backend-stop session cleanup were silently dropped.
  Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
  returned a live-but-dead subscription). Confirm subscriptions with a
  server round-trip so a denial surfaces synchronously, and log async
  permission errors.
- The backend worker connected anonymously when given a JWT without its
  paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
  prefixcache.>, which the frontend publishes and subscribes; add it.

Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.

Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(distributed): acquire and auto-refresh worker NATS credentials

Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.

Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):

- Acquire re-registers through admin approval until the node is approved
  and credentials are minted (or returns the first success when auth is
  not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
  updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
  return an error on exhaustion, so an unapprovable or unrenewable worker
  exits non-zero and surfaces the problem instead of hanging or drifting
  toward an expired credential.

The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.

Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.

Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-03 19:43:56 +02:00
LocalAI [bot]
9d10418593 fix(parakeet-cpp): convert audio before the non-batched transcribe path (#10161)
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".

Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.

Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.

Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-03 15:06:57 +02:00
dependabot[bot]
5470051d4d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/transformers (#10158)
chore(deps): bump grpcio in /backend/python/transformers

Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:43 +02:00
LocalAI [bot]
68c5eeebc3 chore: ⬆️ Update ggml-org/whisper.cpp to 610e664ba7cfe3af46125ed1b5a1184fccb51bcd (#10140)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:38:28 +02:00
dependabot[bot]
1531fabe23 chore(deps): bump securego/gosec from 2.22.9 to 2.27.1 (#10147)
Bumps [securego/gosec](https://github.com/securego/gosec) from 2.22.9 to 2.27.1.
- [Release notes](https://github.com/securego/gosec/releases)
- [Commits](https://github.com/securego/gosec/compare/v2.22.9...v2.27.1)

---
updated-dependencies:
- dependency-name: securego/gosec
  dependency-version: 2.27.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:38:07 +02:00
LocalAI [bot]
b7673d5b76 chore: ⬆️ Update leejet/stable-diffusion.cpp to 2d40a8b2adcdf8b5b0ca0535f3bb7801b6ba13e5 (#10144)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 10:37:51 +02:00
dependabot[bot]
b64bdaf406 chore(deps): bump github.com/google/go-containerregistry from 0.21.5 to 0.21.6 (#10149)
chore(deps): bump github.com/google/go-containerregistry

Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.5 to 0.21.6.
- [Release notes](https://github.com/google/go-containerregistry/releases)
- [Commits](https://github.com/google/go-containerregistry/compare/v0.21.5...v0.21.6)

---
updated-dependencies:
- dependency-name: github.com/google/go-containerregistry
  dependency-version: 0.21.6
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:33 +02:00
dependabot[bot]
eebf08ff1d chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/vllm (#10157)
Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0.
- [Release notes](https://github.com/grpc/grpc/releases)
- [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0)

---
updated-dependencies:
- dependency-name: grpcio
  dependency-version: 1.81.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 10:37:16 +02:00
dependabot[bot]
42e51894c3 chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.65.0 to 0.66.0 (#10151)
chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus

Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.65.0 to 0.66.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.65.0...metric/x/v0.66.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/exporters/prometheus
  dependency-version: 0.66.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 09:14:42 +02:00
LocalAI [bot]
d9ae6481fb chore: ⬆️ Update mudler/parakeet.cpp to 9edf17c3ada66e0f881dcff155492867db7ac4cf (#10141)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-03 08:49:47 +02:00
dependabot[bot]
f1c495a748 chore(deps): bump github.com/mudler/edgevpn from 0.32.2 to 0.34.0 (#10153)
Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.32.2 to 0.34.0.
- [Release notes](https://github.com/mudler/edgevpn/releases)
- [Commits](https://github.com/mudler/edgevpn/compare/v0.32.2...v0.34.0)

---
updated-dependencies:
- dependency-name: github.com/mudler/edgevpn
  dependency-version: 0.34.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 08:34:16 +02:00
LocalAI [bot]
415b561947 docs: fix distributed-mode diagram (workers use NATS, not PostgreSQL) (#10138)
docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL

The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 22:05:33 +02:00
Ettore Di Giacinto
e6a0d4c375 Remove diagram from distributed mode documentation
Removed ASCII diagram of distributed mode architecture from the documentation.

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-06-02 18:48:12 +02:00
LocalAI [bot]
7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00
LocalAI [bot]
aea954a482 docs: position LocalAI as a composable engine, not a bundle (#10136)
Reframe the README hero and docs (homepage, overview, FAQ) around the
composable architecture: a small core, with backends built as dedicated
gRPC services around best-in-class engines, shipped as separate OCI
images and pulled on demand. Lead from strength: drop the "36+ backends"
kitchen-sink framing and the "All-in-One Complete AI Stack" / "single
binary that gives you everything" lines that read as a monolith.

- README: small-core differentiator; composable + open/extensible bullets
- _index.md: composable tagline; install only what you use
- overview.md: core vs on-demand backends; gRPC/OCI mechanics as benefits;
  bring-your-own model and backend
- faq.md: "Do I need to install all the backends?" and
  "Can I bring my own model or backend?"

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 17:34:43 +02:00
Ettore Di Giacinto
595e448714 docs(llama.cpp): note tensor split now works with quantized KV cache (#10135)
The split_mode: tensor description claimed tensor parallelism requires
KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that
restriction by extending the meta backend to preserve shape information
through KV-cache flatten/reshape, so cache_type_k/cache_type_v
quantization can be combined with -sm tensor on builds that include it.

Documentation only: no backend code, grpc-server.cpp comment, or
llama.cpp pin changes.


Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 15:52:23 +02:00
LocalAI [bot]
860f9d63ad feat(parakeet-cpp): dynamic batching for concurrent transcription requests (#10112)
* feat(parakeet-cpp): dynamic-batching scheduler (queue + dispatcher)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): dynamic batching for AudioTranscription via batched JSON C-API

Drop SingleThread; route unary transcription through the in-process batcher
which coalesces concurrent requests into one batched engine call. Streaming
stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms
options (size=1 disables; recommended on CPU).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): tear down dispatcher in Free; log batch config; preallocate; clarify stream lock

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): Ginkgo batcher tests; optional batch C-API binding with per-request fallback

The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2);
probe it with Dlsym and register optionally so the backend still loads against
an older library, falling back to per-request transcription. Rewrites the
batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests).

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(parakeet-cpp): debug-log coalesced batch size in runBatch

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)

Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* chore(parakeet-cpp): bump parakeet.cpp to 8a7c482 (batched decode + B=1 fast-path)

parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder
fast-path to master. Point PARAKEET_VERSION at that commit so the backend
builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the
dynamic batcher calls; the prior pin (30a3075) predated it, so only the
per-request fallback path was exercised. Verified the shared lib builds with
the backend's CMake flags and exports the batch symbol.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 14:49:02 +02:00
LocalAI [bot]
a5a0b3dc4e chore: ⬆️ Update CrispStrobe/CrispASR to 05e60432bcb5bc2113f8c395a41e86497c11504a (#10115)
⬆️ Update CrispStrobe/CrispASR

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 14:48:47 +02:00
番茄摔成番茄酱
94eca04c60 fix(nemo): pin texterrors to 1.1.6 for GLIBCXX compatibility (#10134)
Pin texterrors==1.1.6 before nemo_toolkit[asr] in requirements-cublas13.txt.

The texterrors package (a NeMo transitive dependency) contains a compiled
C++ extension (texterrors_align.so) that may be built from source during
OCI image creation. When built on systems with GCC 14+ (e.g. Ubuntu 24.04),
the resulting binary requires GLIBCXX_3.4.32, which is not available in
the default LocalAI container (Ubuntu 22.04, GLIBCXX up to 3.4.30).

Pinning to 1.1.6 (the latest release) ensures:
- Reproducible builds across environments
- pip resolves the pre-built manylinux2014 wheel (needs only GLIBCXX_3.4.11)
  instead of potentially building from source with a newer toolchain

Fixes #10056

Signed-off-by: 番茄摔成番茄酱 <fqscfqj@outlook.com>
2026-06-02 14:48:27 +02:00
LocalAI [bot]
35bd485d6a chore: ⬆️ Update ggml-org/llama.cpp to 5dcb71166686799f0d873eab7386234302d05ecf (#10128)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:35 +02:00
LocalAI [bot]
1fe96f8d9a chore: ⬆️ Update mudler/parakeet.cpp to 8a7c48209d7882a7ce79a6b306270e4703194543 (#10129)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:19 +02:00
LocalAI [bot]
c508e9d7c6 chore: ⬆️ Update leejet/stable-diffusion.cpp to 7948df8ac1070f5f6881b8d34675821893eb97d6 (#10127)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 09:06:03 +02:00
LocalAI [bot]
55e754fd05 chore: ⬆️ Update ggml-org/whisper.cpp to 23ee03506a91ac3d3f0071b40e66a430eebdfa1d (#10130)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-02 01:43:03 +02:00
LocalAI [bot]
a17753f7d1 chore(model-gallery): ⬆️ update checksum (#10131)
⬆️ Checksum updates in gallery/index.yaml

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 23:39:47 +02:00
Zhao73
c61838dba6 docs: fix documentation typos (#10125)
Correct clear spelling mistakes in documentation without changing behavior.

Confidence: high
Scope-risk: narrow
Tested: git diff --check; uvx codespell on changed files
Not-tested: Full docs build not run; text-only changes
Assisted-by: Codex:gpt-5 codespell
2026-06-01 14:31:08 +02:00
LocalAI [bot]
7013e13f05 chore: ⬆️ Update ggml-org/llama.cpp to 399739d5c5978351f39e3454bfbfbab4f369088f (#10119)
⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 14:24:51 +02:00
Richard Palethorpe
5a0013defe test(react-ui): add page render-smoke specs, reset the coverage gate (#10122)
The UI coverage gate was tightened to 0.1pp against a fast-local
measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower,
so tests-ui-e2e failed there. UI e2e coverage is diffusely
non-deterministic and tracks machine speed — a 0.1pp band can't hold
across environments.

Rather than loosen the gate, raise the floor under it: a render-smoke
spec mounts each lazy page (navigate + assert the header renders),
covering a dozen previously-untested pages and lifting coverage from
~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the
baseline conservatively (40.0), below the slow-CI floor, so the ratchet
holds without flapping.

Document the coverage policy — install the git hooks and don't bypass
them (no --no-verify, no hand-lowering the baseline or widening the
tolerance); raise coverage by adding tests instead; set the UI baseline
below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and
.agents/building-and-testing.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-01 14:24:36 +02:00
LocalAI [bot]
c01ed631d6 refactor(routing): extract replica picker into pkg/clusterrouting (#10123)
Move ReplicaCandidate and PickBestReplica out of core/services/nodes (which depends on gorm) into a new dependency-light leaf package pkg/clusterrouting, so the p2p federation server can later share the same replica-selection policy without pulling in a database driver.

core/services/nodes keeps a type alias and a thin delegator, so every existing reference (the LoadedReplicaStats interface method, the ReplicaCandidate row conversion in registry.go, and the SQL policy-mirror test) compiles and behaves unchanged. This is a pure, behavior-preserving refactor: the full nodes suite, including the policy-mirror spec that pins the SQL ORDER BY to PickBestReplica, stays green.

Assisted-by: Claude Code:claude-opus-4-8

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-01 09:38:55 +02:00
LocalAI [bot]
d47464cb06 docs: ⬆️ update docs version mudler/LocalAI (#10114)
⬆️ Update docs version mudler/LocalAI

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 08:16:29 +02:00
LocalAI [bot]
63f176346e chore: ⬆️ Update leejet/stable-diffusion.cpp to be65ac7511b30379b003626c15224798929e33d4 (#10118)
⬆️ Update leejet/stable-diffusion.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:50 +02:00
LocalAI [bot]
af94d08729 chore: ⬆️ Update ggml-org/whisper.cpp to fe69461618ffc50ba8afa65c25cc6c6e34d4537f (#10117)
⬆️ Update ggml-org/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-06-01 00:43:34 +02:00
LocalAI [bot]
6795d38f50 chore: ⬆️ Update mudler/parakeet.cpp to cb45f68068081af01e7092e91b038ee353eb56be (#10116)
⬆️ Update mudler/parakeet.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-05-31 23:57:15 +02:00
Richard Palethorpe
718223f33b feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models

Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.

Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.

Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(flake): add ffmpeg-headless to the dev shell

pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(localvqe): parse WAV by walking RIFF sub-chunks

Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(security-headers): allow blob: in connect-src for waveform fetch

The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(react-ui): add input/output spectrogram view to AudioTransform

The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).

Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.

- src/utils/fft.js            radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx  canvas heatmap (magma colormap)
- AudioTransform.jsx          dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(react-ui): make UI coverage deterministic, tighten the gate

UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.

Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:

- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)

Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-31 23:56:46 +02:00
LocalAI [bot]
39e050d9e2 fix(parakeet-cpp): cublas/hipblas/vulkan builds were silently CPU-only (#10120)
fix(parakeet-cpp): forward PARAKEET_GGML_* so cublas/hipblas/vulkan builds aren't silently CPU-only

parakeet.cpp gates its GGML backends behind PARAKEET_GGML_CUDA/HIP/VULKAN and
does set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), which overwrites
a bare -DGGML_CUDA=ON back to OFF. So the backend's BUILD_TYPE=cublas (and hipblas,
vulkan) produced a CPU-only libparakeet.so. Forward the PARAKEET_GGML_* options
instead. Verified on a GB10 (CUDA 13): the lib now links libcudart/libcublas and
registers the CUDA backend, vs a CPU-only lib before.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 23:56:07 +02:00
LocalAI [bot]
c222161291 feat(distributed): resumable file uploads via HTTP Content-Range (#10109)
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.

Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
  validates that <start> matches the current on-disk size; mismatches
  return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
  with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
  hash; cross-attempt drift returns 409. On the final chunk the server
  re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
  X-Target-SHA256 for in-progress files (so clients can resume only when
  the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
  semantics — zero behavior change on the happy path.

Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
  resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
  outer time budget
  with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
  link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
  re-HEADs to learn the correct offset.

Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
  hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
  overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
  partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
  resuming from the partial offset.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 11:02:20 +00:00
LocalAI [bot]
aa80d4681b chore: ⬆️ Update ggml-org/llama.cpp to d6588daa800058dfa54f1d7ea695b1a810c8ae18 (#10093)
* ⬆️ Update ggml-org/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix(llama-cpp): skip begin-of-stream null partial in PredictStream

Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.

gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.

Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 10:26:03 +00:00