LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-06 15:56:06 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	d05d83ff36	feat(realtime): stream tool-call turns via tokenizer-template autoparser Per review (richiejp): tool-call deltas exist, so streaming should work with tools too. It does — for models that use their tokenizer template. The C++ autoparser then clears reply.Message and delivers content + tool calls via ChatDeltas, so the streamed transcript carries only spoken content (no tool-call JSON leak) and the tool calls are parsed from the final response. - Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template (grammar-based function calling still buffers, since its call is emitted as JSON in the token stream and would leak into the transcript). - streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content in the token callback, parses tool calls from the final ChatDeltas, and creates the assistant content item lazily so a content-less tool turn emits only the tool calls. - Extract emitToolCallItems from the buffered path so both paths finalize tool calls, response.done, and server-side assistant-tool follow-ups identically. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	076dcdbed8	refactor(realtime): buffer whole message for TTS, drop sentence segmenter Per review (richiejp): the sentence segmenter pipelined unary TTS by splitting on ASCII .!?/newline, which does nothing for languages without those boundaries (CJK/Thai) — there it already degraded to buffering the whole message anyway. Replace it with a uniform model: stream the LLM transcript live, buffer the full message, then synthesize it once. emitSpeech already streams the audio chunks when the backend implements TTSStream and falls back to a single unary delta otherwise, so this is real streaming TTS where supported and a clean whole-message synthesis elsewhere — no per-sentence emulation, no language assumptions. speechStreamer becomes transcriptStreamer (transcript deltas only); the whole-message synthesis moves into streamLLMResponse. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	9ec1456ec6	fix(realtime): clean TTS temp path before read (gosec G304) emitSpeech reads the WAV file the TTS backend wrote. The read moved here from realtime.go, so code-scanning flagged it as a new G304 alert even though the path is backend-controlled (a temp file), not user input. Wrap it in filepath.Clean — a real path normalization that also clears the alert, keeping with the repo's no-#nosec convention. Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	cb3609530a	fix(realtime): always strip reasoning from spoken output disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM config, which the backend reads as enable_thinking=false. But the realtime handler reads that SAME config to drive reasoning extraction, and there DisableReasoning=true means "skip stripping". PredictConfig() returns this LLM config, so both the streamed (speechStreamer) and buffered realtime paths stopped stripping <think>…</think> exactly when disable_thinking was on — leaking raw reasoning to the client whenever the model ignored the enable_thinking hint (e.g. lfm2.5). Add spokenReasoningConfig() which clears DisableReasoning for extraction (keeping custom tokens/tag pairs) and route both realtime paths through it. Spoken output now always strips reasoning, independent of the backend suppression hint. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	f48344f2ff	fix(realtime): register pipeline streaming/thinking config fields TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config field to have a meta registry entry. The four new pipeline fields (disable_thinking, streaming.{llm,tts,transcription}) had none, failing tests-linux/tests-apple. Add toggle entries for them. Also handle the os.Remove return in realtime_speech_test.go to satisfy errcheck (golangci-lint). Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	658a3efb20	docs(realtime): document pipeline streaming + disable_thinking Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	16a5bab71f	feat(realtime): wire streamLLMResponse for token-streamed replies triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is set, the turn has no tools, and audio is requested: streamLLMResponse announces the assistant item, drives the LLM token callback through a speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS), and emits the terminal events. Tool turns and non-streaming pipelines keep the existing buffered path unchanged, so this is strictly opt-in. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	ca23d05c66	feat(realtime): speechStreamer for token-streamed LLM->TTS emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips reasoning via the streaming ReasoningExtractor, emits a transcript delta per content fragment, and sentence-pipes content into emitSpeech so each sentence is synthesized as soon as it's ready. Handler wiring (plain-content turns) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	685e4632d7	feat(realtime): pipeline disable_thinking maps to enable_thinking off applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when pipeline.disable_thinking is set, which gRPCPredictOpts turns into the enable_thinking=false backend metadata. Applied at newModel construction on the per-session LLM config copy, so it doesn't leak to other model users and needs no realtime-specific request plumbing. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	98ed541b22	feat(realtime): streaming transcription text deltas Add emitTranscription and route commitUtterance through it. With pipeline.streaming.transcription set it streams each transcript fragment as a conversation.item.input_audio_transcription.delta via TranscribeStream then a completed event; otherwise it preserves the single completed-event unary behaviour. Returns the final transcript for response generation. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	378d6c25cf	feat(realtime): route response audio through emitSpeech (streaming TTS) Replace the inline unary TTS block in the response handler with emitSpeech, which streams a response.output_audio.delta per backend PCM chunk when pipeline.streaming.tts is set and otherwise preserves the single-delta unary behaviour. emitSpeech returns the accumulated base64 audio, stored on the conversation item as before. Transcript and audio-done events stay in the handler so later per-segment streaming can reuse emitSpeech. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	2c6fdd0570	feat(realtime): emitSpeech with flag-gated streaming TTS emitSpeech synthesizes a piece of text and forwards audio to the client, streaming one output_audio.delta per backend PCM chunk when the pipeline sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the session rate. It emits no transcript/audio-done events so a streamed reply can be split into multiple spoken segments sharing one response. Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport interfaces, driving streaming assertions deterministically. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	2ba2216ce2	feat(realtime): streaming TTS/transcription methods on Model interface Add TTSStream and TranscribeStream to the realtime Model interface and implement them on wrappedModel (delegating to backend.ModelTTSStream / ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the backend's WAV-framed stream (44-byte header carrying the sample rate, then PCM) into raw PCM + sample rate for the realtime transports. Handler wiring that consumes these (flag-gated) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	e0820a11c9	feat(realtime): sentence segmenter for streamed LLM->TTS pipelining streamSegmenter accumulates streamed LLM tokens and emits complete sentence/clause segments (terminator+whitespace, or newline) so TTS can synthesize each segment as it completes instead of waiting for the whole reply. Pure helper; the streaming handler wiring consumes it next. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	16d7704a69	feat(realtime): pipeline streaming + disable_thinking config Add a nested pipeline.streaming.{llm,tts,transcription} block plus pipeline.disable_thinking, with StreamLLM/StreamTTS/StreamTranscription/ ThinkingDisabled helpers. Pointer-bools so unset keeps the unary path; existing configs are unaffected. Wiring into the realtime handler follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
LocalAI [bot]	e837921c2c	feat: forward reasoning_effort to the backend so jinja models honor it (#10184 ) * feat: forward reasoning_effort to the backend so jinja models honor it reasoning_effort was only mapped to the binary enable_thinking toggle and otherwise reached Go-side templates — it was never sent to the backend. So jinja-templated models whose chat template keys on reasoning_effort (gpt-oss Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and kept emitting <think>. Forward the effective reasoning_effort to the backend as a chat_template_kwarg (mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort (request value overrides config default, none->disable / level->enable, an operator's reasoning.disable wins). request.go now uses that helper. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): set the pipeline LLM's reasoning_effort Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime model is built (per-session copy, overrides the LLM's own reasoning_effort), and surface the resolved effort on the template input so Go-templated models get it too. jinja models receive it via the backend metadata. This lets a realtime pipeline disable thinking on models that only honor reasoning_effort (e.g. LFM2.5), which enable_thinking can't. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 13:45:43 +00:00
Richard Palethorpe	73385713ca	feat(distributed): enforce registration token for worker file transfer (#10183 ) The worker HTTP file-transfer server is authenticated by the registration token via checkBearerToken, which fails open on an empty token: every /v1/files, /v1/files-list and /v1/backend-logs request is then served unauthenticated, granting read/write to the worker's models/staging/data directories. The fail-open was also silent (the only auth log sat on the unreachable reject branch), and the worker process never runs DistributedConfig.Validate(), so the existing frontend warning did not cover the component that exposes the server. Mirror the NatsRequireAuth pattern: keep anonymous as the default but make it loud and opt-in enforceable. - Log a prominent warning when the file-transfer server starts tokenless. - Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors on an empty token (frontend) and the worker refuses to start (fail-fast, before registration), so production can fail closed. Also satisfies the F-003 suggestion to fail Validate() on distributed + empty token. - Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both RegistrationRequireAuth and NatsRequireAuth — one production knob locking down the registration/file-transfer layer and the NATS bus together; the granular flags remain available as single-layer overrides. Wired into the frontend, supervisor worker, and agent worker (vLLM worker has neither a NATS connection nor a file-transfer server, so it is left untouched). - Document in distributed-mode.md (warning callout + flag tables). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-05 14:34:28 +02:00
LocalAI [bot]	a4e671779a	chore: ⬆️ Update ggml-org/whisper.cpp to `99613cb720b65036237d44b52f753b51f75c2797` (#10178 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:04:25 +02:00
LocalAI [bot]	7051b2e0a1	chore: ⬆️ Update ggml-org/llama.cpp to `7c158fbb4aec1bdc9c81d6ca0e785139f4826fae` (#10179 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:04:10 +02:00
LocalAI [bot]	469737101a	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `1520eda980564241434b791ce2bbbd128c4be9ea` (#10180 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-05 09:03:08 +02:00
LocalAI [bot]	858257eaf0	fix(distributed): self-heal stale 'model not loaded' routing (#10181 ) * fix(distributed): self-heal stale 'model not loaded' routing In distributed mode the registry can list a model as loaded on a node while the worker has evicted it (autonomous LRU eviction, an out-of-band unload, etc.) yet the backend process survives. The router's cached-node check only verifies the process is alive (probeHealth), so it routes there and inference fails with "<backend>: model not loaded" — and stays broken until the controller restarts and rebuilds its registry. InFlightTrackingClient now reconciles this: when a tracked inference call returns a model-not-loaded error, it drops the stale replica row (RemoveNodeModel) so the next request reloads the model on a healthy node instead of routing back to the evicted one. The original error is returned unchanged; only the registry is corrected. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): typed model-not-loaded error via gRPC status code Replace the controller-side error-string match with a shared, code-aware helper. Go error types don't survive the gRPC boundary, so the signal is carried as a status code (FailedPrecondition): - pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor + IsModelNotLoaded(err) checker (status-code first, message fallback for backends not yet migrated). - InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded. - Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy, rfdetr-cpp) to the typed constructor. Acting on a false positive is harmless (the model is just reloaded). Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 09:01:36 +02:00
Adira	ef80a0e825	fix(config): add face/speaker recognition constants and register insightface + speaker-recognition (#10110 ) FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks both backends by name — but BackendCapabilities had no entries for either, so the UI could not classify them. Also missing were the Method* constants for the five proto-defined RPCs these backends implement (FaceVerify, FaceAnalyze, VoiceVerify, VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings and UsecaseInfoMap entries needed to wire them into the rest of the capability system. Changes: - Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify, MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants - Add UsecaseFaceRecognition ("face_recognition") and UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants - Add UsecaseInfoMap entries for both new usecases, referencing the existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags - Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze - Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze Follows up on #10107 which left these two out because they needed new constants first. Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>	2026-06-04 21:48:01 +02:00
LocalAI [bot]	92726f7631	fix(distributed): stage directory-based models to remote nodes (#10175 ) Distributed file-staging treated every model path field (ModelFile, etc.) as a single regular file: it os.Open'd the path and streamed its fd as the HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose weights and tokenizer ggufs live under one directory referenced by parameters.model — opening the directory succeeds but reading its fd returns EISDIR, so routing the model to a remote NATS worker failed with "read /models/<model>: is a directory". Single-file models were unaffected, so only multi-file pipelines (e.g. the realtime TTS stage) broke. stageModelFiles now detects a directory path field and stages each contained file individually (via the new stageDirectory helper), preserving structure with the existing StagingKeyMapper and rewriting the field to the remote directory (deriving ModelPath as before). countStageableFiles makes the progress total count a directory's files so the staging tracker stays accurate. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 18:05:38 +02:00
LocalAI [bot]	994063ba9a	feat(qwen3-tts-cpp): normalize request language for flexible matching (#10174 ) The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...). Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 17:26:31 +02:00
LocalAI [bot]	c1a55cf72d	chore: ⬆️ Update mudler/parakeet.cpp to `b11fe5bca78ad8b342dd559a43d76df3984bb447` (#10167 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 12:07:09 +02:00
LocalAI [bot]	96758841d8	chore: ⬆️ Update predict-woo/qwen3-tts.cpp to `136e5d36c17083da0321fd96512dc7b263f94a44` (#10165 ) ⬆️ Update predict-woo/qwen3-tts.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 12:06:55 +02:00
LocalAI [bot]	7a59260621	chore: ⬆️ Update CrispStrobe/CrispASR to `13d54e110e1538e0f0bc3af0680b9ab246cfb48d` (#10145 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 12:06:32 +02:00
LocalAI [bot]	27e63b9a78	feat(tts): support per-request instructions and params (#10172 ) The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest nor the gRPC TTSRequest proto carried it, so backends could only read such a value from static YAML options (identical for every request). This blocked per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config to a single designed voice. Plumb a generic per-request instruction string end to end, plus an optional backend-specific params map: - proto: add `optional string instructions` and `map<string,string> params` to TTSRequest. - schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI extension) to schema.TTSRequest. - core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper that attaches instructions only when non-empty (so backends can fall back to YAML when unset); forward them from the /v1/audio/speech handler. - qwen-tts: prefer the per-request instruction over the YAML `instruct` option (used by both mode detection and generation) and merge per-request params. - chatterbox: merge per-request params (coerced to float/int/bool) over YAML options into generate() kwargs. Fully backward compatible: empty instructions fall back to the YAML option and backends that don't support style/voice instructions ignore the field. Closes #10164 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 11:45:02 +02:00
LocalAI [bot]	55c0911c23	chore: ⬆️ Update leejet/stable-diffusion.cpp to `1f9ee88e09c258053fa59d5e05e23dfb10fa0b13` (#10166 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 09:34:34 +02:00
LocalAI [bot]	f6cb6ab6d9	chore: ⬆️ Update ggml-org/llama.cpp to `94a220cd6745e6e3f8de62870b66fd5b9bc92700` (#10168 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 09:34:13 +02:00
LocalAI [bot]	9f11b09c6a	chore(model-gallery): ⬆️ update checksum (#10169 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-04 00:32:15 +02:00
LocalAI [bot]	a5c4f822f0	chore: ⬆️ Update antirez/ds4 to `477c0e82e2699b35a65fd0a1ed6fe66b41087dfe` (#10142 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 19:45:23 +02:00
LocalAI [bot]	fb36c262fe	chore(model gallery): 🤖 add 1 new models via gallery agent (#10163 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 19:44:51 +02:00
LocalAI [bot]	0e4e8980e6	chore: ⬆️ Update ggml-org/llama.cpp to `5c394fdc8b564eff6faacc50a139529d875f0e36` (#10143 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 19:44:21 +02:00
Richard Palethorpe	3a932a9803	feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159 ) * feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED is set, and connect workers with scoped credentials from the register response. Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs. Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode. Assisted-by: Grok:grok grok-build Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(distributed): correct NATS JWT scoping and harden client auth The JWT-auth path added in 46467cc7 had several gaps that fail silently under LOCALAI_NATS_REQUIRE_AUTH: - Agent-worker minted JWTs did not allow the subjects the agent worker actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop), so MCP-CI jobs and backend-stop session cleanup were silently dropped. Scope the agent permission set to those subjects. - NATS subscription permission violations were swallowed (Subscribe returned a live-but-dead subscription). Confirm subscriptions with a server round-trip so a denial surfaces synchronously, and log async permission errors. - The backend worker connected anonymously when given a JWT without its paired seed; reject the unpaired credential instead. - The documented service-user permissions in nats-auth-setup.sh omitted prefixcache.>, which the frontend publishes and subscribes; add it. Also: add a credential-provider hook to the messaging client (consumed by the follow-up credential-lifecycle change), drop the always-nil error from NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct), and gofmt the feature's files. Tests: an agent-JWT e2e spec that connects to the enforcing NATS server and exercises every subscription the agent worker makes, plus permission allow-list coverage unit tests. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(distributed): acquire and auto-refresh worker NATS credentials Workers fetched NATS credentials once at startup, which broke two cases under JWT auth: a worker that registered while still pending admin approval never received a minted JWT (it connected unauthenticated and gave up), and a long-running worker's 24h JWT expired with no way to renew it. Introduce workerregistry.NATSCredentialManager, built on idempotent re-registration (the frontend preserves the node row and mints a fresh JWT each call): - Acquire re-registers through admin approval until the node is approved and credentials are minted (or returns the first success when auth is not required, preserving anonymous-NATS behavior). - RefreshLoop re-registers before the JWT expires (~75% of its lifetime), updating the credentials served to the connection. - Both are bounded (default 100 attempts / consecutive failures) and return an error on exhaustion, so an unapprovable or unrenewable worker exits non-zero and surfaces the problem instead of hanging or drifting toward an expired credential. The messaging client gains WithUserJWTProvider, fetching credentials on each (re)connect so the connection transparently adopts a refreshed JWT when the server expires the old one. RegisterFull exposes the approval status and full response; Register delegates to it. Both the backend worker and the agent worker are wired to this: explicit env credentials are used as-is, minted credentials are acquired-with-wait and refreshed, and a permanent refresh failure shuts the worker down so it restarts and re-acquires. Tests cover Acquire (wait-through-pending, bounded give-up, context cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry exit) and jwtExpiry decoding. Docs updated in distributed-mode.md. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-03 19:43:56 +02:00
LocalAI [bot]	9d10418593	fix(parakeet-cpp): convert audio before the non-batched transcribe path (#10161 ) The direct (non-batched) transcription path handed the original upload path straight to the C library via parakeet_capi_transcribe_path_json. That loader only understands 16 kHz mono WAV/PCM, so any other format (MP3, etc.) failed with "parakeet: failed to load audio: <file>". Only the batched path converted the input (via decodeWavMono16k -> utils.AudioToWav). Every other audio backend (whisper, crispasr) converts unconditionally with utils.AudioToWav before handing the file to its engine; the parakeet-cpp fallback was the lone exception. Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that produces a 16 kHz mono WAV in a temp dir, and run the non-batched path through it before calling the C loader. WAV inputs already in the target format are passed through without ffmpeg. Add specs covering the helper (decodable copy + cleanup, and an error on a missing input) that need neither the model, the C library, nor ffmpeg. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-03 15:06:57 +02:00
dependabot[bot]	5470051d4d	chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/transformers (#10158 ) chore(deps): bump grpcio in /backend/python/transformers Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0. - [Release notes](https://github.com/grpc/grpc/releases) - [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0) --- updated-dependencies: - dependency-name: grpcio dependency-version: 1.81.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 10:38:43 +02:00
LocalAI [bot]	68c5eeebc3	chore: ⬆️ Update ggml-org/whisper.cpp to `610e664ba7cfe3af46125ed1b5a1184fccb51bcd` (#10140 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 10:38:28 +02:00
dependabot[bot]	1531fabe23	chore(deps): bump securego/gosec from 2.22.9 to 2.27.1 (#10147 ) Bumps [securego/gosec](https://github.com/securego/gosec) from 2.22.9 to 2.27.1. - [Release notes](https://github.com/securego/gosec/releases) - [Commits](https://github.com/securego/gosec/compare/v2.22.9...v2.27.1) --- updated-dependencies: - dependency-name: securego/gosec dependency-version: 2.27.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 10:38:07 +02:00
LocalAI [bot]	b7673d5b76	chore: ⬆️ Update leejet/stable-diffusion.cpp to `2d40a8b2adcdf8b5b0ca0535f3bb7801b6ba13e5` (#10144 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 10:37:51 +02:00
dependabot[bot]	b64bdaf406	chore(deps): bump github.com/google/go-containerregistry from 0.21.5 to 0.21.6 (#10149 ) chore(deps): bump github.com/google/go-containerregistry Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.5 to 0.21.6. - [Release notes](https://github.com/google/go-containerregistry/releases) - [Commits](https://github.com/google/go-containerregistry/compare/v0.21.5...v0.21.6) --- updated-dependencies: - dependency-name: github.com/google/go-containerregistry dependency-version: 0.21.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 10:37:33 +02:00
dependabot[bot]	eebf08ff1d	chore(deps): bump grpcio from 1.80.0 to 1.81.0 in /backend/python/vllm (#10157 ) Bumps [grpcio](https://github.com/grpc/grpc) from 1.80.0 to 1.81.0. - [Release notes](https://github.com/grpc/grpc/releases) - [Commits](https://github.com/grpc/grpc/compare/v1.80.0...v1.81.0) --- updated-dependencies: - dependency-name: grpcio dependency-version: 1.81.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 10:37:16 +02:00
dependabot[bot]	42e51894c3	chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.65.0 to 0.66.0 (#10151 ) chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus Bumps [go.opentelemetry.io/otel/exporters/prometheus](https://github.com/open-telemetry/opentelemetry-go) from 0.65.0 to 0.66.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/exporters/prometheus/v0.65.0...metric/x/v0.66.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/otel/exporters/prometheus dependency-version: 0.66.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 09:14:42 +02:00
LocalAI [bot]	d9ae6481fb	chore: ⬆️ Update mudler/parakeet.cpp to `9edf17c3ada66e0f881dcff155492867db7ac4cf` (#10141 ) ⬆️ Update mudler/parakeet.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-03 08:49:47 +02:00
dependabot[bot]	f1c495a748	chore(deps): bump github.com/mudler/edgevpn from 0.32.2 to 0.34.0 (#10153 ) Bumps [github.com/mudler/edgevpn](https://github.com/mudler/edgevpn) from 0.32.2 to 0.34.0. - [Release notes](https://github.com/mudler/edgevpn/releases) - [Commits](https://github.com/mudler/edgevpn/compare/v0.32.2...v0.34.0) --- updated-dependencies: - dependency-name: github.com/mudler/edgevpn dependency-version: 0.34.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-06-03 08:34:16 +02:00
LocalAI [bot]	415b561947	docs: fix distributed-mode diagram (workers use NATS, not PostgreSQL) (#10138 ) docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-02 22:05:33 +02:00
Ettore Di Giacinto	e6a0d4c375	Remove diagram from distributed mode documentation Removed ASCII diagram of distributed mode architecture from the documentation. Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-06-02 18:48:12 +02:00
LocalAI [bot]	7e59a5c7c5	docs: architecture & feature diagrams (blueprint style) (#10137 ) * docs: add 'how LocalAI works' architecture diagram Add a blueprint-style architecture diagram: clients -> small core (API, router, WebUI, agents) -> gRPC -> backend processes pulled on demand as OCI images. Place it on the overview page and replace the stale external architecture image on the reference page. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add blueprint diagrams across feature, distributed & getting-started docs Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under docs/static/images/diagrams/, wired into their docs pages, from an impact-vs-effort audit of the docs. Broaden the API surface on the overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama, and LocalAI's own API) and move the gRPC boundary label clear of the arrows. Pages: distributed mode (architecture, scheduling, ds4 layer-split), distributed inferencing, MLX, realtime, quantization, MCP, agents, mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face recognition, reranker, function calling, fine-tuning (recipe + jobs), diarization, audio transform, quickstart, model resolution. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add composable-core diagram to README hero Commit the composable-core card (small core + on-demand backend tiles) alongside the other diagrams and reference it from the README hero via a repo-relative path, so it renders on GitHub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: fix composable-core connectors/badge and federated-vs-worker layout - composable-core: thicken the plug-in connectors so they read clearly, and widen the SEPARATE IMAGE badge so its text no longer overflows the box. - federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and replace the tangled node-to-node activation arrows with a clean fan-out (request split across all sharded nodes), mirroring the federated panel. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-02 18:43:22 +02:00
LocalAI [bot]	aea954a482	docs: position LocalAI as a composable engine, not a bundle (#10136 ) Reframe the README hero and docs (homepage, overview, FAQ) around the composable architecture: a small core, with backends built as dedicated gRPC services around best-in-class engines, shipped as separate OCI images and pulled on demand. Lead from strength: drop the "36+ backends" kitchen-sink framing and the "All-in-One Complete AI Stack" / "single binary that gives you everything" lines that read as a monolith. - README: small-core differentiator; composable + open/extensible bullets - _index.md: composable tagline; install only what you use - overview.md: core vs on-demand backends; gRPC/OCI mechanics as benefits; bring-your-own model and backend - faq.md: "Do I need to install all the backends?" and "Can I bring my own model or backend?" Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-02 17:34:43 +02:00
Ettore Di Giacinto	595e448714	docs(llama.cpp): note tensor split now works with quantized KV cache (#10135 ) The split_mode: tensor description claimed tensor parallelism requires KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that restriction by extending the meta backend to preserve shape information through KV-cache flatten/reshape, so cache_type_k/cache_type_v quantization can be combined with -sm tensor on builds that include it. Documentation only: no backend code, grpc-server.cpp comment, or llama.cpp pin changes. Assisted-by: Claude Code:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-02 15:52:23 +02:00

1 2 3 4 5 ...

6601 Commits