LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-07 00:06:51 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	076dcdbed8	refactor(realtime): buffer whole message for TTS, drop sentence segmenter Per review (richiejp): the sentence segmenter pipelined unary TTS by splitting on ASCII .!?/newline, which does nothing for languages without those boundaries (CJK/Thai) — there it already degraded to buffering the whole message anyway. Replace it with a uniform model: stream the LLM transcript live, buffer the full message, then synthesize it once. emitSpeech already streams the audio chunks when the backend implements TTSStream and falls back to a single unary delta otherwise, so this is real streaming TTS where supported and a clean whole-message synthesis elsewhere — no per-sentence emulation, no language assumptions. speechStreamer becomes transcriptStreamer (transcript deltas only); the whole-message synthesis moves into streamLLMResponse. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	9ec1456ec6	fix(realtime): clean TTS temp path before read (gosec G304) emitSpeech reads the WAV file the TTS backend wrote. The read moved here from realtime.go, so code-scanning flagged it as a new G304 alert even though the path is backend-controlled (a temp file), not user input. Wrap it in filepath.Clean — a real path normalization that also clears the alert, keeping with the repo's no-#nosec convention. Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	cb3609530a	fix(realtime): always strip reasoning from spoken output disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM config, which the backend reads as enable_thinking=false. But the realtime handler reads that SAME config to drive reasoning extraction, and there DisableReasoning=true means "skip stripping". PredictConfig() returns this LLM config, so both the streamed (speechStreamer) and buffered realtime paths stopped stripping <think>…</think> exactly when disable_thinking was on — leaking raw reasoning to the client whenever the model ignored the enable_thinking hint (e.g. lfm2.5). Add spokenReasoningConfig() which clears DisableReasoning for extraction (keeping custom tokens/tag pairs) and route both realtime paths through it. Spoken output now always strips reasoning, independent of the backend suppression hint. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	f48344f2ff	fix(realtime): register pipeline streaming/thinking config fields TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config field to have a meta registry entry. The four new pipeline fields (disable_thinking, streaming.{llm,tts,transcription}) had none, failing tests-linux/tests-apple. Add toggle entries for them. Also handle the os.Remove return in realtime_speech_test.go to satisfy errcheck (golangci-lint). Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	16a5bab71f	feat(realtime): wire streamLLMResponse for token-streamed replies triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is set, the turn has no tools, and audio is requested: streamLLMResponse announces the assistant item, drives the LLM token callback through a speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS), and emits the terminal events. Tool turns and non-streaming pipelines keep the existing buffered path unchanged, so this is strictly opt-in. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	ca23d05c66	feat(realtime): speechStreamer for token-streamed LLM->TTS emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips reasoning via the streaming ReasoningExtractor, emits a transcript delta per content fragment, and sentence-pipes content into emitSpeech so each sentence is synthesized as soon as it's ready. Handler wiring (plain-content turns) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	685e4632d7	feat(realtime): pipeline disable_thinking maps to enable_thinking off applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when pipeline.disable_thinking is set, which gRPCPredictOpts turns into the enable_thinking=false backend metadata. Applied at newModel construction on the per-session LLM config copy, so it doesn't leak to other model users and needs no realtime-specific request plumbing. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	98ed541b22	feat(realtime): streaming transcription text deltas Add emitTranscription and route commitUtterance through it. With pipeline.streaming.transcription set it streams each transcript fragment as a conversation.item.input_audio_transcription.delta via TranscribeStream then a completed event; otherwise it preserves the single completed-event unary behaviour. Returns the final transcript for response generation. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	378d6c25cf	feat(realtime): route response audio through emitSpeech (streaming TTS) Replace the inline unary TTS block in the response handler with emitSpeech, which streams a response.output_audio.delta per backend PCM chunk when pipeline.streaming.tts is set and otherwise preserves the single-delta unary behaviour. emitSpeech returns the accumulated base64 audio, stored on the conversation item as before. Transcript and audio-done events stay in the handler so later per-segment streaming can reuse emitSpeech. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	2c6fdd0570	feat(realtime): emitSpeech with flag-gated streaming TTS emitSpeech synthesizes a piece of text and forwards audio to the client, streaming one output_audio.delta per backend PCM chunk when the pipeline sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the session rate. It emits no transcript/audio-done events so a streamed reply can be split into multiple spoken segments sharing one response. Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport interfaces, driving streaming assertions deterministically. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	2ba2216ce2	feat(realtime): streaming TTS/transcription methods on Model interface Add TTSStream and TranscribeStream to the realtime Model interface and implement them on wrappedModel (delegating to backend.ModelTTSStream / ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the backend's WAV-framed stream (44-byte header carrying the sample rate, then PCM) into raw PCM + sample rate for the realtime transports. Handler wiring that consumes these (flag-gated) follows. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
Ettore Di Giacinto	e0820a11c9	feat(realtime): sentence segmenter for streamed LLM->TTS pipelining streamSegmenter accumulates streamed LLM tokens and emits complete sentence/clause segments (terminator+whitespace, or newline) so TTS can synthesize each segment as it completes instead of waiting for the whole reply. Pure helper; the streaming handler wiring consumes it next. Assisted-by: Claude:claude-opus-4-8 go vet Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 14:03:36 +00:00
LocalAI [bot]	e837921c2c	feat: forward reasoning_effort to the backend so jinja models honor it (#10184 ) * feat: forward reasoning_effort to the backend so jinja models honor it reasoning_effort was only mapped to the binary enable_thinking toggle and otherwise reached Go-side templates — it was never sent to the backend. So jinja-templated models whose chat template keys on reasoning_effort (gpt-oss Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and kept emitting <think>. Forward the effective reasoning_effort to the backend as a chat_template_kwarg (mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort (request value overrides config default, none->disable / level->enable, an operator's reasoning.disable wins). request.go now uses that helper. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(realtime): set the pipeline LLM's reasoning_effort Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime model is built (per-session copy, overrides the LLM's own reasoning_effort), and surface the resolved effort on the template input so Go-templated models get it too. jinja models receive it via the backend metadata. This lets a realtime pipeline disable thinking on models that only honor reasoning_effort (e.g. LFM2.5), which enable_thinking can't. Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-05 13:45:43 +00:00
LocalAI [bot]	27e63b9a78	feat(tts): support per-request instructions and params (#10172 ) The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest nor the gRPC TTSRequest proto carried it, so backends could only read such a value from static YAML options (identical for every request). This blocked per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config to a single designed voice. Plumb a generic per-request instruction string end to end, plus an optional backend-specific params map: - proto: add `optional string instructions` and `map<string,string> params` to TTSRequest. - schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI extension) to schema.TTSRequest. - core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper that attaches instructions only when non-empty (so backends can fall back to YAML when unset); forward them from the /v1/audio/speech handler. - qwen-tts: prefer the per-request instruction over the YAML `instruct` option (used by both mode detection and generation) and merge per-request params. - chatterbox: merge per-request params (coerced to float/int/bool) over YAML options into generate() kwargs. Fully backward compatible: empty instructions fall back to the YAML option and backends that don't support style/voice instructions ignore the field. Closes #10164 Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-04 11:45:02 +02:00
Richard Palethorpe	3a932a9803	feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159 ) * feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED is set, and connect workers with scoped credentials from the register response. Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs. Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode. Assisted-by: Grok:grok grok-build Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(distributed): correct NATS JWT scoping and harden client auth The JWT-auth path added in 46467cc7 had several gaps that fail silently under LOCALAI_NATS_REQUIRE_AUTH: - Agent-worker minted JWTs did not allow the subjects the agent worker actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop), so MCP-CI jobs and backend-stop session cleanup were silently dropped. Scope the agent permission set to those subjects. - NATS subscription permission violations were swallowed (Subscribe returned a live-but-dead subscription). Confirm subscriptions with a server round-trip so a denial surfaces synchronously, and log async permission errors. - The backend worker connected anonymously when given a JWT without its paired seed; reject the unpaired credential instead. - The documented service-user permissions in nats-auth-setup.sh omitted prefixcache.>, which the frontend publishes and subscribes; add it. Also: add a credential-provider hook to the messaging client (consumed by the follow-up credential-lifecycle change), drop the always-nil error from NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct), and gofmt the feature's files. Tests: an agent-JWT e2e spec that connects to the enforcing NATS server and exercises every subscription the agent worker makes, plus permission allow-list coverage unit tests. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(distributed): acquire and auto-refresh worker NATS credentials Workers fetched NATS credentials once at startup, which broke two cases under JWT auth: a worker that registered while still pending admin approval never received a minted JWT (it connected unauthenticated and gave up), and a long-running worker's 24h JWT expired with no way to renew it. Introduce workerregistry.NATSCredentialManager, built on idempotent re-registration (the frontend preserves the node row and mints a fresh JWT each call): - Acquire re-registers through admin approval until the node is approved and credentials are minted (or returns the first success when auth is not required, preserving anonymous-NATS behavior). - RefreshLoop re-registers before the JWT expires (~75% of its lifetime), updating the credentials served to the connection. - Both are bounded (default 100 attempts / consecutive failures) and return an error on exhaustion, so an unapprovable or unrenewable worker exits non-zero and surfaces the problem instead of hanging or drifting toward an expired credential. The messaging client gains WithUserJWTProvider, fetching credentials on each (re)connect so the connection transparently adopts a refreshed JWT when the server expires the old one. RegisterFull exposes the approval status and full response; Register delegates to it. Both the backend worker and the agent worker are wired to this: explicit env credentials are used as-is, minted credentials are acquired-with-wait and refreshed, and a permanent refresh failure shuts the worker down so it restarts and re-acquires. Tests cover Acquire (wait-through-pending, bounded give-up, context cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry exit) and jwtExpiry decoding. Docs updated in distributed-mode.md. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-06-03 19:43:56 +02:00
LocalAI [bot]	76fe0bb929	feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099 ) * feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * polish(crispasr): brand error strings + fix stale shim comment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): register backend in root Makefile Mirror the whisper Go backend registration for the new crispasr backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks, BACKEND_CRISPASR definition, docker-build target generation, and the docker-build-backends aggregate target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add backend build matrix entries Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64, CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add crispasr backend gallery entries Add the crispasr meta anchor and its full set of image gallery entries (cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64, L4T cuda13 arm64, plus -development variants), mirroring the whisper backend gallery block. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in backend/go/crispasr/Makefile. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): don't wire fixture-gated test into test-extra Mirror the whisper Go backend: its AudioTranscription test is gated on model/audio fixtures and skips in CI, so building crispasr (the heaviest ggml compile in the tree) inside the unit-test lane adds a long compile for zero coverage. The backend image build in backend-matrix.yml remains the authoritative compile check. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): add darwin metal build entry (mirror whisper) The metal-crispasr gallery entries and capabilities.metal mapping reference -metal-darwin-arm64-crispasr, which is only produced by an includeDarwin entry. Mirror whisper's darwin metal entry so the tag actually gets built. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(crispasr): place hipblas matrix entry next to whisper twin Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): register crispasr as pref-only ASR backend + test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): port whisper behavioral suite (cancellation + streaming) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): fix skip message env var names to CRISPASR_* Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): switch shim to crispasr_session_* multi-architecture API The shim used whisper_full(), which in CrispASR is the whisper-only path: libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR, Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI, which auto-detects the architecture from the GGUF and dispatches to the matching backend. Rewrite the C shim around crispasr_session_open / _transcribe_lang / _result_* and add get_backend() so the selected backend is logged. load_model now takes a threads param (session_open binds n_threads at open). The session result is segment+word based with no token IDs and no per-decode callback, so drop n_tokens / get_token_id / get_segment_speaker_turn_next / set_new_segment_callback. set_abort is kept for API parity but is best-effort: the session transcribe is blocking with no abort hook. Update the purego bindings and gocrispasr.go to match: tokens are left empty, speaker-turn handling is removed, and AudioTranscriptionStream emits one delta per non-empty segment after the blocking decode returns (no progressive streaming via the session API), preserving the concat(deltas) == final.Text invariant. crispasr_session_set_translate is exported by libcrispasr but not declared in crispasr.h, so it is forward-declared in the shim alongside the open/transcribe/result functions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * build(crispasr): link full CrispASR backend set for multi-arch support The shim's crispasr_session_* dispatch calls into the per-architecture backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer, sensevoice, ...), which CrispASR builds as static archives. Linking only crispasr + ggml dead-stripped every backend object from the final module (nm backend-symbol count: 0), leaving a whisper-only .so. Link the same backend set as crispasr-cli so the static archives are pulled in. After this the module carries the backend symbols (nm count 407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to every compiled-in architecture. Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to ${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone and as a subproject; the sed is idempotent. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): adapt suite to session API (blocking, no decode callback) Register the new symbol set (drop the removed token/speaker/callback funcs, add get_backend; load_model now takes 2 args). The session transcribe is blocking with no abort hook, so a mid-decode cancel can't interrupt it: change the cancellation spec to cancel the context before the call and assert codes.Canceled from the pre-call ctx.Err() check, dropping the <5s mid-decode timing assertion. The streaming spec still holds with per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR ASR model entries (-crispasr) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(gallery): keep only session-auto-detectable CrispASR ASR models The crispasr backend loads models via crispasr_session_open, which auto-detects the backend from the GGUF general.architecture using crispasr_detect_backend_from_gguf. Architectures not in that detect map cannot be opened, so those gallery entries fail to load. Removed entries whose architecture is not wired into CrispASR v0.6.11's session auto-detect router (they can be re-added when upstream maps them): - Not in the detect map: data2vec, firered-asr, funasr, fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr, moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b}, paraformer, sensevoice. - Pending verification (filename-heuristic routed, not arch-detected): parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the fastconformer-ctc backend by a filename heuristic in the model registry, which implies general.architecture is not a mapped string. Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py writes general.architecture="parakeet" unconditionally and encodes the rnnt/ctc distinction in metadata fields, so they session-auto-detect. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz) Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse the already-open g_session (crispasr_session_open auto-detects a TTS model) and dispatch to the upstream synthesis call, which returns malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we do not set, so it returns NULL here and surfaces as an error Go-side. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): implement TTS/TTSStream gRPC methods Bind the new shim functions via purego and implement TTS, TTSStream and a writeWAV24k helper. synthesize copies the C-owned PCM out before freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via go-audio/wav. CrispASR has no progressive synth, so TTSStream synthesizes fully, encodes to WAV, and emits the bytes as a single chunk; it owns the results-channel close (the gRPC server wrapper ranges until close), mirroring vibevoice-cpp's TTSStream. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): log when a TTS voice override is not honored Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add CrispASR vibevoice-tts model entry Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox, and orpheus require companion codec/s3gen/SNAC paths (set_codec_path / set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2 aren't in the session auto-detect map. Those are follow-ups. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(crispasr): gated TTS synthesis spec Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint) The crispasr Go file is entirely new, so new-from-merge-base lints every line (unlike the grandfathered whisper backend it was forked from): - handle os.RemoveAll / fh.Close return values in AudioTranscription - annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): backend: and codec: model options (explicit arch + companion files) Add two model-config options to the CrispASR backend via opts.Options: - backend:<name> selects an explicit CrispASR backend (bypassing auto-detect) by routing load_model through crispasr_session_open_explicit, unlocking architectures the detector won't pick on its own (qwen3, cohere, granite, voxtral, moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.). - codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC, chatterbox s3gen, or mimo-asr tokenizer) via the universal crispasr_session_set_codec_path setter after the session opens. A relative path resolves against the model directory. rc==0 means success or not-applicable; only a negative rc is fatal. The C shim load_model gains a backend_name argument and a new set_codec_path entry point; the Go bridge parses the prefix:value options and registers the new symbol. The vad_only path is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(gallery): use virtual.yaml base for crispasr models The crispasr entries are just backend + model + a couple options, fully expressed inline via overrides:/files: in gallery/index.yaml. Point each url: at the shared gallery/virtual.yaml (the established 'virtual' model trick) and drop the 36 redundant per-model gallery/-crispasr.yaml files. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts) Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the current shim: the codec: companion loads fine, but these engines additionally need a voice pack / voice prompt / reference clip (qwen3-tts base errors 'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is 'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice, e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts) speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice (voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the default voice; req.Voice still overrides the speaker per request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-31 12:11:03 +02:00
LocalAI [bot]	a44bdb29d4	feat: prefix-cache-aware routing for distributed mode (#10071 ) * feat(radixtree): generic prefix tree skeleton with longest-match Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Insert with path recency refresh and entry cap Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): recency-weighted per-value Weight Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(radixtree): Remove all entries for a value Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(radixtree): race-free concurrency smoke test Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard Address review findings on the generic prefix tree: - Extract a shared pruneWalk helper parameterized by a shouldClear predicate and use it from Evict, Remove, and the MaxEntries path. Previously evictOldestLocked cleared a victim's value but never removed the now value-less node or its childless ancestors, so internal nodes accumulated under sustained churn at the cap. The MaxEntries path now prunes the victim and its empty ancestors. - DRY: pruneWalk replaces the duplicated logic in the former pruneLocked and Remove's inner closure. - Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take the read lock (RLock) while Insert, Evict and Remove keep the write lock. Confirmed race-clean under go test -race. - Document the strict greater-than TTL boundary on Options.TTL and expired: age exactly equal to TTL is still live. - Guard Insert against an empty key (no-op): the root never holds a value. Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation, the no-growth-past-cap invariant, the TTL boundary, and empty-key behavior for both Insert and LongestMatch. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): RoutePolicy enum with parse/resolve Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Config with defaults and validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): deterministic xxhash prefix-chain extractor Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): pure filter-then-score replica selection Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): Provider interface and radix-tree-backed Index Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(prefixcache): gofmt policy enum comment alignment Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): head-first prefix chunking and hoist Weight out of sort Address code-quality review findings in the prefixcache package. Correctness: ExtractChain now chunks from absolute offset 0 with fixed [0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth head blocks. The previous tail-keeping logic shifted the byte offset by a non-window amount once a conversation grew past MaxDepthWindowBytes, changing every hash each turn and silently breaking cross-turn longest-prefix matching. The reusable KV/prefix cache lives at the head of the prompt, so anchoring at offset 0 makes the chain a true prefix-chain: P and P+suffix share their full leading overlap. Add a regression spec proving cross-turn stability past the cap. Performance: Index.Decide precomputes each candidate's Weight once (decorate-sort-undecorate) instead of calling the O(tree size) Weight inside the O(n log n) sort comparator. Behavior is unchanged. Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual byte loop, clearing the modernize rangeint finding. Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's documented concurrency safety under go test -race. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(messaging): prefixcache observe/invalidate subjects and payloads Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): NATS sync publish/apply for observe and invalidate Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): ctx carrier for prefix-hash chain Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributedhdr): PrefixChainHook indirection for backend-side chain build Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): stash prompt prefix chain on ctx before distributed routing Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend): mirror modelID fallback for prefix-chain salt parity Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scheduling config columns for prefix-cache routing Add RoutePolicy and per-model balance/prefix-match override columns to ModelSchedulingConfig and include them in the SetModelScheduling upsert DoUpdates list so updates are not dropped on conflict. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): optional route preference in FindAndLockNodeWithModel Add a RoutePreference type and a new pref parameter so the atomic pick+lock+increment can be biased toward a preferred node without weakening atomicity. A nil preference reproduces the previous ORDER BY behavior exactly. Update the ModelRouter interface, both router.go call sites (pass nil for now; Phase 5 builds the real preference), the test doubles, and the distributed e2e caller. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(prefixcache): make Sync satisfy Provider with Evict Sync.Observe now returns whether the local index treated the assignment as new or extended, and Sync gains an Evict method that delegates to the wrapped index. Together these let SmartRouter hold a single prefixcache.Provider that broadcasts via NATS. Adds a compile-time Provider assertion and an Evict-delegates behavioral test. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil keeps routing byte-for-byte the round-robin floor). On each request Route now calls buildPreference: it reads the prompt prefix chain from ctx (distributedhdr.PrefixChain), resolves the per-model policy/thresholds over the global config, loads candidate replica in-flight via a new registry read LoadedReplicaStats (deduped to one entry per node using the MIN in-flight across that node's replicas), asks the provider to Decide, and runs prefixcache.Select. The chosen node is passed as the RoutePreference to FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick, cold scheduleAndLoad), and the served node is recorded via Observe only when the resolved policy is prefix_cache so round-robin models never pollute the tree. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): invalidate prefix-cache entries on unload and stale removal UnloadModel and both staleness fall-through paths in Route (after a failed gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model, nodeID), guarded by a nil-provider check so the round-robin floor is unchanged. At runtime the provider is the prefixcache.Sync, so invalidations also broadcast to peer frontends. Adds a test that a previously hot prefix no longer Decides to a node after UnloadModel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(prefixcache): rolling forced-disturb pressure counter Add a concurrency-safe per-model rolling counter that tracks how many times a request had a usable hot prefix match but the load guard forced it off the warm node. Entries outside the window are dropped lazily on Count so the backing slice stays bounded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): autoscale on prefix-cache forced-disturb pressure Wire the rolling forced-disturb counter into the SmartRouter and the ReplicaReconciler. Router: in buildPreference, after Decide + Select, record a forced-disturb when a usable hot prefix match existed (d.HotNodeID != "" and d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or nothing) because the load guard ruled the warm node out. This is the scale-worthy signal: the cache-warm replica is saturated. It deliberately does not fire for all-unique workloads (no hot match), avoiding false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil keeps the path a no-op. Reconciler: read the same Pressure instance in reconcileModel as an extra scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel guards and the UnsatisfiableUntil cooldown that gates the whole method. Pressure never overrides MaxReplicas and never force-evicts; a no-capacity model does not spin. Window and threshold come from prefixcache.Config (PressureWindow default 1m, PressureScaleThreshold default 1) and are configurable via ReplicaReconcilerOptions. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow Record now prunes entries older than the rolling window (the same prune Count does), via a shared pruneLocked helper, so a model that takes forced-disturb records but is never Counted (e.g. one with zero loaded replicas the reconciler skips) no longer grows its backing slice unbounded. Also removes the dead pressureWindow struct field and the ReplicaReconcilerOptions.PressureWindow option from the reconciler: they were stored but never read (the window lives inside the prefixcache.Pressure instance). The scale block now reads pressure.Count once into a local. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(api): prefix-cache fields in scheduling endpoint DTO with validation Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): prefix-cache routing controls in node scheduling form Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): wire prefix-cache index, NATS sync, and config Activates prefix-cache-aware routing in distributed mode. Builds the prefixcache Index + NATS-backed Sync + Pressure counter, installs the distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix chain per request, subscribes to prefixcache.observe/prefixcache.invalidate to apply peers' events to the local index (no re-broadcast), threads PrefixProvider/PrefixConfig/Pressure into the SmartRouter and Pressure/PressureThreshold into the ReplicaReconciler, and runs a background eviction ticker (every TTL/2) bound to the app context. Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE) opts out and leaves the provider/pressure nil so routing stays round-robin. --distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m) controls entry idle-timeout and eviction cadence. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(nodes): round-robin-floor invariant for prefix-cache routing Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never picked even with a perfect prefix match (round-robin floor holds), while a balanced hot node within the load slack is reused. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(prefixcache): clear branch lint findings and em dashes Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): validate prefix-cache config at startup wiring Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(radixtree): single-walk WeightsFor for batch value weights Add Tree.WeightsFor(values, now) which computes the recency-weighted weight for many values in a single O(N + len(values)) tree traversal, versus calling Weight once per value (O(len(values) * N)). Consumers that score K candidates against the tree under the read lock no longer pay K full walks. Extract the per-entry contribution math into an unexported helper shared by both Weight and WeightsFor so the metric stays identical (DRY). Weight's public behavior is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): add ModelConfig.ModelID() single source of truth The c.Name fallback to c.Model was duplicated in core/backend/options.go (feeding model.WithModelID) and hand-copied into core/backend/llm.go (the prefix-chain salt). These MUST agree or the prefix-cache salt diverges silently from the id the model loader tracks. Consolidate both into a new config.ModelConfig.ModelID() helper and call it from both sites. Behavior is identical. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): reuse one xxhash.Digest in ExtractChain ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth per call) and grew the chain slice without preallocation. Reuse a single Digest via Reset() before each block and preallocate the chain to min(nBlocks, MaxDepth). xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical value to a fresh New()+Write. Output hashes are unchanged, preserving the cross-process determinism that peers rely on over NATS. Verified by capturing ExtractChain output for the existing test inputs before and after the refactor: identical. Existing extractor tests pass unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk Index.Decide called radixtree.LongestMatch over the whole tree, so the deepest match could be a node that is offline, unloaded, or simply not in the passed candidate set. Honoring that as HotNodeID produced a false forced-disturb signal upstream (buildPreference records pressure when chosen != HotNodeID), making it look like a warm replica was load saturated when it was actually absent. Build the candidate set once and only set HotNodeID/MatchRatio when the matched node is an actual candidate; otherwise fall back to cold placement. A future refinement could ask the tree for the longest match restricted to the candidate nodes (shallower-but-valid) instead of dropping it. Also replace the per-candidate tree.Weight call in the cold-order sort with a single tree.WeightsFor walk, turning O(KN) under the read lock into O(N + K). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> refactor(prefixcache): remove Select's unreachable deterministic fallback buildPreference always passes ColdOrder as a permutation of the full candidate set, so the cold-order loop hits every eligible candidate. The trailing best/bestIF scan was dead. Replace it with a plain "return """ and document that ColdOrder is guaranteed to cover all candidates, so "" means none were eligible. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): fetch model scheduling config once per Route GetModelScheduling was read three times per request - in resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling - three DB round-trips for one row that is immutable for the life of the request, and not a consistent snapshot. Fetch it once near the top of Route and thread the ModelSchedulingConfig (may be nil) into all three helpers. scheduleNewModel keeps its own fetch since it runs outside the Route snapshot. Behavior is identical for nil sched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(autoscale): add Pressure.Reset to consume forced-disturb signal Pressure.Count is non-draining (it prunes only by age), so a single burst of forced-disturbs stays within the rolling window for the whole window and keeps Count >= threshold on every reconciler tick. The reconciler will use Reset to clear a model's events after acting on the signal so a fresh scale-up requires fresh forced-disturbs to accumulate, rather than one burst driving the model toward MaxReplicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(autoscale): at most one scale-up per reconcile tick, consume pressure Two autoscale bugs: 1. Over-scaling: the pressure scale-up block read Pressure.Count but never consumed it. With a non-draining counter a single forced-disturb burst kept Count >= threshold across the whole window, firing scaleUp on every tick and pushing the model toward MaxReplicas off one transient burst. After a successful pressure-triggered scale-up the reconciler now calls Pressure.Reset to consume the signal. 2. Double scale-up in one tick: the all-replicas-busy block and the pressure block could both fire in the same reconcileModel pass, each calling scaleUp(+1) against the same `current` read once at the top, so a model that was both busy and over threshold scaled +2 and could overshoot MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1) per tick: the pressure block is skipped if the busy block already scaled, and scale-down is skipped in any tick that scaled up. MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation Add SetReplicaRemovedHook to NodeRegistry and fire it from both RemoveNodeModel and RemoveAllNodeModelReplicas after a successful delete. This is the single chokepoint every replica-removal path funnels through (router eviction, reconciler scale-down, probe reaper, health-monitor node-down reap, RemoteUnloaderAdapter), so the prefix-cache index can be invalidated by construction rather than wiring each call site individually. The hook is stored in an atomic.Pointer so the startup wiring (setter) and the request/reconcile-time fire are race-free; it is nil-safe when unset. GORM Delete reports no error for a no-op delete, so the hook also fires when nothing was removed; the consumer's Invalidate(model, node) is idempotent so this is harmless. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): invalidate prefix-cache on any replica removal via registry hook Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): single source of truth for threshold bounds Extract ValidateThresholds into prefixcache/config.go so the per-model override validation (nodes.go endpoint) and Config.Validate share one implementation of the numeric bounds (min_prefix_match in [0,1], balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead of hard-coding them in two places. The route_policy allow-list stays explicit (not ParsePolicy, which maps typos to Default). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): preserve prefix-cache settings on partial scheduling update A scheduling POST that omitted route_policy/thresholds (e.g. a min_replicas-only update) full-replaced every column and silently reset the model's previously-configured prefix-cache settings to empty/zero. Make the four prefix-cache request fields pointers so omitted is distinguishable from explicit zero, and merge PATCH-style in SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves the existing config value (zero default when none). Non-prefix fields keep their full-replace PUT semantics. Validation now runs on the resolved values via prefixcache.ValidateThresholds. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica removal of every model, including round-robin models that never used the prefix cache. Index.Invalidate previously called tree(model), which lazily created and permanently retained an empty radix tree for any model that ever lost a replica, growing the trees map without bound. Sync.Invalidate also published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op removals across the cluster. Index.Invalidate now looks the tree up read-only via existingTree and returns without allocating when none exists. The Provider interface is unchanged; Sync gates the broadcast through an optional invalidateExisting(bool) capability type-asserted from the wrapped Index, falling back to the prior always-broadcast behavior for other Provider implementations. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort WeightsFor already returns a map keyed by every requested candidate, so the separate candidates set built to validate the hot match was redundant: a node is a candidate iff it is a key in the weights map. Drop the extra map and gate the hot-match check on weights membership. Also skip the sort when there is at most one candidate, since the input order is already the cold order. Behavior is unchanged. Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins would need lazy cross-file changes and is out of scope here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns Bulk node-scoped node_models deletes (Register re-register cleanup, MarkOffline, MarkDraining, Deregister) removed rows directly without firing the replica-removed hook, so the prefix-cache index kept pointing at nodes whose models were gone. Capture the DISTINCT model names before each bulk delete and fire fireReplicaRemoved once per model after a successful delete, restoring the single-chokepoint invariant for all removal paths. The pre-query is skipped when no hook is set so the no-hook path stays cheap. Also narrow LoadedReplicaStats to SELECT only node_id and in_flight (the only fields the router consumer reads), dropping the JOIN-side available_vram fetch and unused columns while keeping the []ReplicaCandidate return type unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(reconciler): consume autoscale signals only on a real scale-up scaleUp was fire-and-forget (void) yet its callers unconditionally consumed the pressure signal (Pressure.Reset) and the MinReplicas hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp added nothing (ScheduleAndLoadModel errored, or no node could be loaded) the saturated warm replica got no new replica AND its accumulated forced-disturb history was wiped, forcing the signal to re-accumulate over a full PressureWindow before the next attempt. Make scaleUp return whether at least one replica was actually scheduled, and gate the side effects on it: - pressure block (2b): set scaledUp and call Pressure.Reset only on success; on failure preserve the signal so the next tick retries off the same accumulated pressure. - busy-burst block (2): set scaledUp from the return value so a failed attempt does not suppress the pressure path or scale-down. - MinReplicas block: call ClearUnsatisfiable only on success so a failed attempt does not reset the unsatisfiable counter. All existing invariants (MaxReplicas, capacity gating, UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are preserved. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(nodes): drop router's redundant prefix-cache Invalidate calls The NodeRegistry removal chokepoint (RemoveNodeModel / RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which invalidates the prefix-cache index. The router was also calling prefixProvider.Invalidate explicitly right after each registry removal on the two stale-replica health-probe fall-throughs in Route and in UnloadModel, so every router-side eviction invalidated twice (double tree-prune + double NATS broadcast). Remove the three redundant explicit Invalidate calls and their empty nil-guards. Each removed call sat immediately after a registry removal that fires the hook, so invalidation is preserved via the chokepoint. Decide/Observe usage is untouched. Re-point the unit test (fake registry fires no hook) to assert the removal chokepoint is exercised on unload instead of the router's direct invalidation. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic) Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(nodes): make capture+delete atomic in bulk node_models removal paths MarkOffline, MarkDraining, and the Register re-register cleanup ran the nodeModelNames SELECT and the bulk node_models DELETE as two separate statements on r.db with no transaction. A SetNodeModel landing between the two was deleted but its replica-removed hook never fired, leaving the prefix-cache index pointing at a removed replica until TTL or candidacy self-heal. Wrap the capture and the delete in a single db.Transaction in each path (mirroring how Deregister already does it). The captured model names are collected into a slice declared outside the closure; the replica-removed hook fires for each only after the transaction commits, so a rollback never invalidates the index for a removal that did not persist. The set of fired hooks now equals exactly the set of node_models rows actually deleted, with no interleaving gap. The status flip in MarkOffline/MarkDraining (setStatus) is a separate, pre-existing operation and routing already filters non-healthy nodes, so it stays outside the transaction; return contracts are unchanged. Deregister was already correct and is untouched. The cheap-path skip (no hook -> skip the SELECT) is preserved. Adds a spec asserting MarkOffline fires hooks for exactly the rows it deletes and leaves no node_models row behind (consistent snapshot). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(nodes): debug logging for prefix-cache routing decisions and observations Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(radixtree): match shared prefixes by valuing every node on insert Insert recorded the value (node id) only on the final node of the key chain, leaving every intermediate prefix node valueless. LongestMatch returns the deepest node that hasValue, so two chains that share a leading block but diverge in the tail never matched: only exact-repeat queries hit. That broke the prefix-cache routing core use cases (shared system prompt, multi-turn extension, volatile tail), all of which rely on prefix matching rather than exact-repeat. Set value/hasValue/lastSeen at every node along the chain so each prefix-block node remembers the node id that served that prefix (SGLang/vLLM-style). The deepest match wins, and the last writer owns a shared prefix node (a recency heuristic: the most recent chain through a block is the one most likely still warm). size now counts valued nodes, which is the intended meaning. Updated radixtree tests to the new semantics: deepest-prefix test uses non-overlapping chains, a new test asserts last-writer-owns-shared-node, Evict/Remove/MaxEntries expectations recomputed for per-prefix-node counting, and a shared-prefix LongestMatch red test added. Added a prefixcache Decide test proving a prefix-only query routes to the warm node. No prefixcache .go logic changed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): lock in prefix-cache routing behavior end to end Add a DB-backed e2e spec that drives SmartRouter against a real NodeRegistry (Postgres testcontainer) and the real prefixcache.Index radix-tree provider, using a fake gRPC backend factory so no real inference runs. Covers the five behaviors validated by hand: 1. Cold miss + observe: an unseen prefix chain cold-places and is recorded. 2. Hot-match affinity: the same chain returns to its warm node X. 3. Shared-prefix match: a divergent chain sharing X's leading prefix still routes to X (the radix-tree regression we fixed). 4. Negative control: an unrelated chain is a cold miss, not a false hot match on X. 5. Failover + invalidation: removing X's replica fires the registry chokepoint hook to invalidate the prefix entry, and the chain fails over to surviving node Y and re-homes there. Replaces the need for manual docker-compose re-runs. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(prefixcache): make prefix-cache affinity replica-granular Track prefix-cache affinity per loaded replica (a backend process with its own KV cache) instead of per node, so multiple replicas of the same model on one node each keep distinct affinity and a hot prefix routes back to the exact replica that served it. - radixtree: add RemoveFunc(pred) and reimplement Remove on top of it. - prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/ PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to drop every replica of a node; Invalidate drops one replica. Select returns (ReplicaKey, bool) and gains a deterministic least-in-flight eligible fallback (tiebreak NodeID then Replica). - messaging: carry Replica on PrefixCacheObserveEvent and PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node). - Sync delegates + broadcasts with replica; InvalidateNode broadcasts Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode. This is part 1 of 2; the registry/router/wiring consumers are updated separately. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): make prefix-cache routing replica-granular Wire the SmartRouter, NodeRegistry, and distributed startup to the replica-keyed prefixcache API. Affinity is now tracked per replica (each replica is a separate process with its own KV cache), so a prefix served by (node,0) no longer leaks onto the same-node sibling (node,1). - RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks the EXACT (node_id, replica_index) row, falling through to the default ORDER BY when that replica is not loaded. - SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires the specific replica, RemoveAllNodeModelReplicas and the four bulk node-scoped deletes fire replica<0 (all replicas of the node). - buildPreference builds one Candidate per loaded replica and locks the exact replica the policy chose; observePrefix records the served ReplicaKey at every call site. - distributed.go routes the hook to InvalidateNode (replica<0) or Invalidate(key). - Tests updated to the replica-keyed API plus new coverage: a hot prefix on (node,0) prefers replica 0 over the same-node sibling (router unit + e2e), and FindAndLock locks the exact preferred replica. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): derive prefix chain from messages for tokenizer-template models Prefix-cache-aware routing built its prompt-prefix chain from the rendered prompt string `s` in ModelInference. For models with TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the backend tokenizes the structured messages itself - so `s` is empty, the chain is empty, and routing silently falls back to round-robin. That covers the bulk of modern chat models (qwen3, llama3, ...), so the feature effectively never engaged for them. Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable head-first serialization of the conversation (role + content per turn). Two requests sharing a leading system prompt and early turns share a leading byte prefix, which ExtractChain maps to a shared chain prefix - landing both on the same cache-warm replica. The rendered `s` is still preferred when present (higher fidelity for non-template models). Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision" logs despite per-request Route calls, traced to the empty-chain guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document prefix-cache routing roadmap Add a routing-and-caching roadmap section to the distributed-mode guide, linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and NVIDIA Dynamo. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-30 23:24:22 +02:00
Richard Palethorpe	12d1f3a697	security(http): refuse redirects on outbound clients via hardened pkg/httpclient (#10087 ) LocalAI's outbound HTTP clients used Go's default redirect policy, which follows up to 10 redirects. On a cross-host redirect Go forwards custom request headers — including credential headers such as Anthropic's x-api-key — to the redirect target (Go strips Authorization, Cookie and WWW-Authenticate cross-host, but NOT arbitrary custom headers). An attacker able to elicit a redirect from an upstream (a hijacked or spoofed upstream, DNS trickery, or a malicious upstream_url) then harvests the operator's provider API key. This was first reported against the cloud-proxy / MITM PII path (GHSA-3mj3-57v2-4636); the same class affects every other outbound client. Rather than patch each call site, add pkg/httpclient as the one sanctioned constructor for outbound HTTP and route everything through it. pkg/httpclient: - New(...) refuses redirects, TLS 1.2 floor, no body deadline (streaming/SSE safe) - NewWithTimeout(d) simple request/response calls - WithFollowRedirects opt-in following that still strips credential headers on any cross-host hop; different scheme/host/port == different origin, guarding the curl CVE-2022-27774 port-confusion class - WithTransport(rt) keep a custom transport (IP-pin, HTTP/2, a credential-injecting RoundTripper) - HardenedTransport() base transport with the TLS floor + bounded setup - Harden(c) apply the policy to a library-supplied http.Client - NoRedirect the CheckRedirect policy; wraps ErrRedirectBlocked Lint: a forbidigo rule flags http.DefaultClient and http.Get/Post/ PostForm/Head, pointing at pkg/httpclient (.golangci.yml, .agents/coding-style.md). forbidigo cannot match the &http.Client{} composite literal without also flagging legitimate http.Client type references, so that form is enforced by review. Migrates every non-test outbound call site across core/, pkg/, cmd/, and the Go backend (backend/go/cloud-proxy). Credential-bearing and internal-RPC clients refuse redirects; download / CDN / registry clients use WithFollowRedirects so they keep working while stripping secrets cross-host. The only credential-bearing client that follows redirects is the gated-download path (pkg/downloader/uri.go), which strips the token on the cross-host hop to the CDN. Hardening this closes, in passing: - MCP remote-server bearer token leaking via a redirect (the RoundTripper re-injected Authorization on every hop) - agent multimedia/webhook clients leaking user-supplied auth headers - cors_proxy following redirects, bypassing its SSRF IP-pin - downloader's authorized read path leaking the token cross-host Fixes: GHSA-3mj3-57v2-4636 (cloud-proxy leaks operator provider API key (x-api-key) to attacker host on cross-host redirect) Reported-by: tonghuaroot Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-30 12:04:10 +02:00
泊舟	e1a782b70f	fix(openai): stop streaming tool-call double-emission when autoparser is active (#10055 ) Streaming /v1/chat/completions could emit the same logical tool call at multiple `index` values. In processStreamWithTools the Go-side iterative parser (ParseXMLIterative / ParseJSONIterative) runs on every token and emits tool-call deltas, while the C++ chat-template autoparser delivers its own tool calls via ChatDeltas that are flushed at end-of-stream by ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths active the same call is emitted twice at different indices, so OpenAI clients that accumulate tool calls by `index` dispatch the tool N times. Skip the Go-side iterative parser once the autoparser is producing tool calls (hasChatDeltaToolCalls). The deferred flush stays guarded by lastEmittedCount, so the race where the Go parser emitted before the flag flipped also remains single-emission. Backends without an autoparser (e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected. Refs #9722 Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com> Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-29 11:39:09 +02:00
Tai An	0fd666ee6e	fix(openresponses): populate Content and accept bare {role,content} items (#10039 ) (#10040 ) * fix(openresponses): populate Content and accept bare {role,content} items (#10039) Fixes mudler/LocalAI#10039 — `/v1/responses` silently returned empty output on any model whose YAML doesn't include a Go-side `template.chat_message` block. Three cooperating bugs: * `convertORInputToMessages` populated only `StringContent` for string input and for the `input.Instructions` system message, leaving the `Content` (any) field nil. * `TemplateMessages` gated all fallback content-rendering branches on `Content != nil && StringContent != ""` — but every branch in that function consumes `StringContent`, not `Content`. The `&&` silently dropped messages that had StringContent set and Content nil, producing an empty prompt that the 5× empty-retry guard then turned into a 200 OK with `output: []`. * The array-input branch of `convertORInputToMessages` dispatched on `itemMap["type"]` with no default, dropping bare `{role, content}` items emitted by the OpenAI Python SDK helper `client.responses.create(input=[{...}])`. Fix: * Set both `Content` and `StringContent` in the two openresponses message-construction sites that only set one. * Treat a bare `{role, content}` item (no `type`) as `type: "message"` for OpenAI-SDK compatibility. * Gate `TemplateMessages` fallback rendering on `StringContent != ""`, which is what every downstream branch in that function actually reads. Regression test added to `evaluator_test.go` covering the fallback path (no `ChatMessage` template) with a StringContent-only message, both with and without a role mapping. * test(openresponses): guard Content population and ToProto path (#10039) Add regression tests for the two seams the original fix touched but left uncovered: * convertORInputToMessages must populate both Content and StringContent for plain string input and for bare {role, content} array items (the OpenAI SDK shape that omits the type discriminator). Both are functional reds against the pre-fix code. * Messages.ToProto reads Content, not StringContent — this is the path UseTokenizerTemplate backends (imported GGUFs) take. The cases pin that contract so a future regression on the producer side is caught. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-28 07:21:48 +00:00
Richard Palethorpe	8d70855ea6	test: add Go + React UI coverage gates and fill test gaps (#9989 ) - Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline) run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple recursive roots, builds with -tags auth, and folds in the in-process tests/e2e suite via --coverpkg. - React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc, nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a UI coverage gate (make test-ui-coverage-check) with a small tolerance since e2e line coverage jitters ~0.5pp run-to-run. - pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage gate on react-ui changes; install with make install-hooks. - New Go handler tests (settings, branding), hermetic base64 download test. - fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate renders again; covered by a regression test. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-26 22:06:10 +02:00
LocalAI [bot]	e4c70fca7a	fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk (#10000 ) When the C++ autoparser is in pure-content fallback mode (qwen3-4b after model emits a tool-call JSON in non-thinking mode, the streaming worker ended the SSE stream with a spurious data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}} chunk carrying the same JSON that was already in delta.tool_calls. The Go-side ReasoningExtractor is configured from DetectThinkingStartToken, which scans the model's jinja chat template verbatim and finds <think> inside an {% if enable_thinking %} block without evaluating the conditional. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and makes ExtractReasoning treat everything after as reasoning. The autoparser correctly classifies zero reasoning (qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content), but processStreamWithTools then preferred extractor.Reasoning() over functions.ReasoningFromChatDeltas at the end-of-stream flush — handing the polluted Go-side state to buildDeferredToolCallChunks, which emitted it as a trailing reasoning chunk. Two changes: * Add a sticky preferAutoparser flag to processStreamWithTools, mirroring the analogous flag in processStream from #9985. Once any ChatDelta carries content or reasoning, the flag stays on for the rest of the stream and the worker stops falling back to the Go-side extractor for per-token deltas. This avoids the per-chunk leak path and the cumulative pollution. * Extract chooseDeferredReasoning, a small helper that selects the end-of-stream reasoning source. When preferAutoparser is set, return functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to extractor.Reasoning() (the correct source for vLLM and other backends with no autoparser). The helper has a focused test suite covering both sides of the contract: autoparser-active with empty reasoning (the qwen3 case — the fix's purpose), autoparser-active with real reasoning_content (jinja-with-recognized-format models), and autoparser-not-active with genuine Go-side reasoning (vLLM-style backends). E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with name='exec' and the right arguments, finish_reason=tool_calls, and zero reasoning chunks — down from one polluted reasoning chunk before this fix. Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end; the helper unit tests are independent. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-26 08:34:26 +02:00
LocalAI [bot]	f17d99f6e5	fix(streaming/tools): stop healing-marker stubs from gating off content (#9999 ) * fix(streaming/tools): stop healing-marker stubs from gating off content When the C++ autoparser is in pure-content fallback mode (e.g. qwen3 without --jinja) and the model emits a tool call as JSON, the streaming worker calls ParseJSONIterative on each new chunk. parseJSONWithStack heals partial input like `{` into `{"<marker>":1}` where <marker> is a random integer. removeHealingMarkerFromJSON only stripped the marker from values, so the synthetic key survived and downstream callers saw a stub object with a random-looking key. chat_stream_workers.go's JSON tool-call detector then bumped lastEmittedCount past the stub even though no real tool call was emitted, gating off ALL subsequent content chunks. The qwen3 + tools + streaming case ended up dribbling only the first `{"` to clients and then nothing, even when the model went on to call the noAction `answer({"message": "…"})` pseudo-tool. Three changes, each with its own regression test: * removeHealingMarkerFromJSON now strips the marker suffix from keys too, dropping the entry when the truncated key is empty. Inputs like `{` no longer leak `{"<marker>":1}` to callers; partial keys like `{ "code` still preserve the model-typed prefix `code`. * ParseJSONIterative skips empty-after-healing maps so a healed `{` doesn't surface as a stub result. * The streaming JSON detector now breaks (not continues) on entries without a usable `name`, and only bumps lastEmittedCount past successfully-emitted entries. Defense-in-depth against any future partial-parse shape. The parser tests cover eight partial-JSON-prefix shapes and verify no marker characters leak into keys, plus the two early shapes (`{`, `{"`) that should not surface a stub at all. Fixes #9988 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(streaming/tools): cover the autoparser-correctly-working path Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas and unit-test it against every shape that can hit the streaming worker: * the bug case — a healing-marker stub at index 0 must NOT bump lastEmittedCount, so subsequent content chunks keep flowing; * the autoparser-correctly-working case — empty jsonResults (because the C++ autoparser cleared the raw text and delivers tool calls via TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream emitter to ship the autoparser's tool calls; * a single complete tool call — emit one chunk, advance to 1; * arguments arriving as a JSON-string vs as a nested object — both serialize to the wire as JSON-string arguments; * multiple parallel tool calls — one chunk each; * a real tool call followed by a partial stub — emit the real one, stop at the stub, resume on a later chunk once the stub completes. Locks down the no-regression guarantee the user asked for: this PR's fix is scoped to the pure-content fallback path; when the autoparser actually classifies tool calls (jinja-recognized chat format with tool support), the helper is a no-op and nothing changes. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 23:55:35 +02:00
LocalAI [bot]	1c6c3adad6	fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode (#9991 ) When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery family), llama.cpp's chat parser falls back to a "pure content" PEG parser that dumps the entire raw response into ChatDelta.Content with an empty ReasoningContent. The Go side then trusted that content verbatim and overrode tokenCallback's correctly-split reasoning, so <think>...</think> blocks ended up in the OpenAI `content` field. Regression from v4.0.0 introduced when the autoparser ChatDeltas path was added (#9224). The override now runs Go-side reasoning extraction defensively when the autoparser delivered content but no reasoning. The streaming worker gains a sticky preferAutoparser flag that flips on the first chunk where the autoparser classified reasoning_content; until then we use the streaming Go-side extractor. Realtime mirrors the non-streaming fallback. When the autoparser already populated ReasoningContent we trust it untouched, so jinja-enabled installs are not regressed. gallery/qwen3.yaml now enables use_jinja, letting the autoparser classify <think> natively for all 20+ qwen3 family entries that share this template. Fixes #9985 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 22:39:50 +02:00
LocalAI [bot]	06e777b75e	feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) (#9976 ) * feat(distributed): add per-request node ID context holder Introduce pkg/distributedhdr, a leaf package carrying a per-request atomic.Value holder for the picked worker node ID from the SmartRouter (core/services/nodes) up to the HTTP response writer wrapper (core/http/middleware). Avoids the import cycle that a shared key in either consumer would create. Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The holder is atomic.Value so cross-goroutine publish from the router to the response writer wrapper is race-clean. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): add ExposeNodeHeader middleware + response writer wrapper New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID reveals internal topology and is opt-in). The middleware creates a per-request atomic.Value holder, attaches it to c.Request().Context() via distributedhdr.WithHolder, and wraps c.Response().Writer with a custom http.ResponseWriter that sets the X-LocalAI-Node header on first Write / WriteHeader / Flush by reading the holder. Implements http.Flusher, http.Hijacker, Unwrap so it composes cleanly with Echo and http.NewResponseController. request.go propagates the holder onto derived contexts via distributedhdr.Inherit so the holder survives the correlation-ID context replacement. Unit + race-clean concurrency + integration specs. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): stamp node ID in router and wire middleware to inference routes ModelRouterAdapter.Route stamps the picked node ID into the per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right after replica selection. Wire ExposeNodeHeader middleware to: - OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting - Anthropic /v1/messages - Ollama /api/chat, /api/generate, /api/embed, /api/embeddings - Jina /v1/rerank - LocalAI /v1/vad The middleware's wrapper reads the holder on first byte and sets the X-LocalAI-Node response header before delegating to the underlying writer. Per-request scope means no race under concurrent multi-replica routing. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): thread request context through backend Load + cover ctx propagation Five non-OpenAI backend helpers were silently using app.Context instead of the request context for the gRPC backend call: transcription, TTS, image generation, rerank, VAD. Effect: distributedhdr.Stamp in the router callback was a silent no-op for these paths, AND client cancellation didn't propagate to in-flight inference. Thread c.Request().Context() (or the equivalent input.Context after the request middleware has installed the correlation-ID derived context) through each helper and into ModelOptions via model.WithContext(ctx). ImageGeneration's signature gains a leading ctx parameter; in-tree callers (openai image, openai inpainting, openai inpainting_test) are updated to match. ModelEmbedding gains a leading ctx parameter for the same reason; the openai and ollama embedding handlers pass the request context through. chat_stream_workers.go defers the initial role=assistant chunk emission until the first token callback so the wrapper's lazy X-LocalAI-Node lookup against the loader runs AFTER ml.Load has stamped the per-modelID node ID; semantically identical for clients (role still arrives before any text). Regression test core/backend/ctx_propagation_test.go pins ctx propagation for all five helpers. Docs updated to enumerate the full endpoint coverage of the --expose-node-header flag. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 10:51:48 +02:00
Richard Palethorpe	6a80e23733	feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802 ) Add a routing middleware stack and a cloud-proxy backend. * cloud-proxy: a Go gRPC backend that forwards OpenAI- and Anthropic-shaped chat requests to upstream providers, with an optional translate mode (OpenAI request -> Anthropic /v1/messages -> OpenAI response) and full tool-calling support. * routing: admission control, content-aware model routing (embedding cache + classifier + rerank + Arch-Router score), PII detection/redaction (regex + NER) with streaming filter and OpenAI/Anthropic adapters, and a per-user/per-key billing recorder backed by GORM or in-memory storage. * middleware: UsageMiddleware records usage via the billing recorder, plus admission, route-model, usage-stamp and trace middlewares. * observability: BackendTrace ring buffer stores full request bodies (capped), MITM proxy emits structured trace events, and router classifier decisions surface at /api/router/decide. * gallery: Arch-Router-1.5B (Q4_K_M and Q8_0). * UI: cloud-proxy model-editor fields, classifier system-prompt and score-normalization config, and a Traces page rendering request bodies. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-25 09:28:27 +02:00
LocalAI [bot]	0b2ae3c6ca	fix(openai): stream usage non-zero when tools are enabled (#9941 ) * chore: ignore local .worktrees directory Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(openai): stream usage non-zero when tools are enabled The streaming chat-completions worker for tool-bearing requests (processTools in core/http/endpoints/openai/chat.go) never forwarded the cumulative TokenUsage from ComputeChoices to the chunks it placed on the responses channel. The outer streaming loop's running usage tracker therefore stayed at the zero value, and the include_usage trailer reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever the request carried a `tools` array. Without tools, the alternative `process` path stamps Usage on every chunk, so that path was unaffected. Forward the final TokenUsage via a usage-only sentinel chunk (empty Choices, populated Usage) emitted right before close(responses). The outer loop's per-chunk Usage capture moves above the empty-Choices skip so the sentinel updates the tracker without ever reaching the wire, keeping the existing OpenAI spec contract (intermediate chunks carry no `usage` field, and the deferred-final-chunk helpers remain Usage-free per the regression test for issue #8546). Adds streamUsageFromTokenUsage, usageSentinelChunk, and applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level test that mirrors the outer-loop sequence. Fixes #9927 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4-7 [Claude Code] * refactor(openai): return final TokenUsage from stream workers Replace the usage-only sentinel SSE chunk introduced in the previous commit with a plain return value. The streaming workers process and processTools (now extracted as package-level processStream and processStreamWithTools) return (backend.TokenUsage, error); the outer ChatEndpoint loop reads the cumulative counts off the existing `ended` channel (now carrying streamWorkerResult{usage, err}) and builds the include_usage trailer from a normal Go value after the LOOP exits. This drops the empty-Choices "skip but capture Usage" rule from the outer loop and removes the usageSentinelChunk / applyChunkToUsage helpers entirely. The SSE responses channel is back to a single purpose: wire chunks only. processStream and processStreamWithTools move into chat_stream_workers.go so they can be exercised directly from tests. The chat_stream_usage_test.go suite now drives the workers with a mocked backend.ModelInferenceFunc and asserts on the returned TokenUsage. The regression coverage for issue #9927 is therefore behavioral: reverting the fix (discarding ComputeChoices' usage return) makes the assertions fail with concrete count mismatches. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-22 10:13:41 +02:00
LocalAI [bot]	a39e025d64	fix(nodes): make per-node backend install async via gallery job queue (#9928 ) * feat(galleryop): add TargetNodeID to ManagementOp for single-node installs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeScopedKey helpers for per-node opcache rows Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): make /api/nodes/:id/backends/install async via gallery service job queue The handler previously called unloader.InstallBackend synchronously and blocked the browser for up to 3 minutes waiting on the NATS reply. It now enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and returns HTTP 202 + jobID immediately, matching /api/backends/install/:id. The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent installs of the same backend across different nodes do not stomp each other. galleryService/opcache/appConfig are threaded through RegisterNodeAdminRoutes for this. Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(http): log malformed backend_galleries override and stop test drain goroutine Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(api): expose nodeID for node-scoped backend ops in /api/operations Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>" keys. Without splitting that prefix back out, the operations panel renders the full key as the display name and has no structured way to label which worker an install is targeting. Detect the prefix, surface nodeID as its own response field, and reduce the display name back to the bare backend slug. Bare (non-scoped) ops are left untouched so legacy installs do not gain a misleading empty nodeID. Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(react-ui): poll job status for node-targeted backend installs Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(react-ui): clarify async semantics in handleInstallOnTarget Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(http): use statusUrl casing for node install response to match codebase precedent Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-21 22:25:53 +02:00
LocalAI [bot]	a39591f144	realtime: honor output_modalities to skip TTS in text-only mode (#9838 ) * realtime: honor output_modalities to skip TTS in text-only mode The emulated realtime pipeline previously ignored the OpenAI Realtime spec field output_modalities and always synthesized TTS. Add resolveOutputModalities + modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio* emission so a client requesting ["text"] gets only ResponseOutputText* events. This lets thin clients (e.g. thing5-poc) cache TTS on the client side while still using the realtime WS for VAD + STT + LLM + tool-call parsing. Assisted-by: Claude:claude-opus-4-7 * realtime: plumb response-level output_modalities and echo on session Follow-up to the previous commit: - Resolve response.create's output_modalities at the gate so a per-response override of an audio session is honored (the test asserted this contract but the production call site was passing nil). - Mirror OutputModalities in the RealtimeSession echo so session.update round-trips the client-supplied value, matching MaxOutputTokens's pattern. Assisted-by: Claude:claude-opus-4-7 * realtime: silence errcheck on deferred os.Remove of TTS file CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)` inside the audio-emission block (now wrapped by the modality gate). Wrap the call in a closure that explicitly discards the error — the canonical Go pattern for "I want to defer a cleanup whose error I genuinely don't care about." Assisted-by: Claude:claude-opus-4-7 golangci-lint --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-15 12:39:47 +02:00
massy_o	745473cbe6	Validate video image URLs before download (#9819 ) Signed-off-by: massy-o <telitos000@gmail.com>	2026-05-14 15:07:17 +02:00
LocalAI [bot]	8af963bdd9	fix(streaming): comply with OpenAI usage / stream_options spec (#9815 ) * fix(streaming): comply with OpenAI usage / stream_options spec (#8546) LocalAI emitted `"usage":{"prompt_tokens":0,...}` on every streamed chunk because `OpenAIResponse.Usage` was a value type without `omitempty`. The official OpenAI Node SDK and its consumers (continuedev/continue, Kilo Code, Roo Code, Zed, IntelliJ Continue) filter on a truthy `result.usage` to detect the trailing usage chunk; LocalAI's zero-but-non-null usage on every intermediate chunk made that filter swallow every content chunk and surface an empty chat response while the server log looked successful. Changes: - `core/schema/openai.go`: `Usage OpenAIUsage \`json:"usage,omitempty"\`` so intermediate chunks no longer carry a `usage` key. Add `OpenAIRequest.StreamOptions` with `include_usage` to mirror OpenAI's request field. - `core/http/endpoints/openai/chat.go` and `completion.go`: keep using the `Usage` struct field as an in-process channel for the running cumulative, but strip it before JSON marshalling. When the request set `stream_options.include_usage: true`, emit a dedicated trailing chunk with `"choices": []` and the populated usage (matching the OpenAI spec and llama.cpp's server behavior). - `chat_emit.go`: new `streamUsageTrailerJSON` helper; drop the `usage` parameter from `buildNoActionFinalChunks` since chunks no longer carry usage. - Update `image.go`, `inpainting.go`, `edit.go` to wrap their Usage values with `&` for the new pointer field. - UI: send `stream_options:{include_usage:true}` from the React (`useChat.js`) and legacy (`static/chat.js`) chat clients so the token-count badge keeps populating now that the server is spec-compliant. Tests: - New `chat_stream_usage_test.go` pins the spec invariants: intermediate chunks have no `usage` key, the trailer JSON has `"choices":[]` and a populated `usage`, and `OpenAIRequest` parses `stream_options.include_usage`. - Update `chat_emit_test.go` to reflect that finals no longer embed usage. Verified against the live LocalAI instance: before the fix Continue's filter logic swallowed 16/16 token chunks; with the new shape it yields 4/5 and routes usage through the dedicated trailer chunk. Fixes #8546 Assisted-by: Claude:opus-4.7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(streaming): silence errcheck on usage trailer Fprintf The new spec-compliant `stream_options.include_usage` trailer writes were flagged by errcheck since they're new code (golangci-lint runs new-from-merge-base on master); the surrounding `fmt.Fprintf` data: writes are grandfathered. Drop the return values explicitly to match the linter's contract without adding a nolint shim. Assisted-by: Claude:opus-4.7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-14 08:53:46 +02:00
Richard Palethorpe	0245b33eab	feat(realtime): Add Liquid Audio s2s model and assistant mode on talk page (#9801 ) * feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model: single engine handles VAD, transcription, LLM, and TTS in one bidirectional stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline. Backend - backend/python/liquid-audio/ — new Python gRPC backend wrapping the `liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets, Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/ Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch on `liquid_audio.utils.snapshot_download` so absolute local paths from LocalAI's gallery resolve without a HF round-trip. soundfile in place of torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle). - backend/backend.proto + pkg/grpc/{backend,client,server,base,embed, interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream (config/frame/control oneof in; typed event+pcm+meta out). - core/services/nodes/{health_mock,inflight}_test.go — add stubs for the new RPC to the test fakes. Config + capabilities - core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row. - core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups membership in both speech-input and audio-output groups so a lone flag still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases branch. Realtime endpoint - core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig() so the gate is unit-testable; accept realtime_audio models and self-fill empty pipeline slots with the model's own name (user-pinned slots win). - core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil cfg, empty pipeline, legacy pipeline, self-contained realtime_audio, user-pinned VAD slot, and partial legacy pipeline. UI + endpoints - core/http/routes/ui.go — /api/pipeline-models accepts either a legacy VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a self_contained flag so the Talk page can collapse the four cards. - core/http/routes/ui_api.go — realtime_audio in usecaseFilters. - core/http/routes/ui_pipeline_models_test.go — covers both code paths. - core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of the four-slot grid; rename Edit Pipeline → Edit Model Config; less pipeline-specific wording. - core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new realtime_audio filter button + i18n. - core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO. - core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset fields, surfaced when backend === liquid-audio, plumbed via extra_options on submit/export/import. Gallery + importer - gallery/liquid-audio.yaml — config template with known_usecases: [realtime_audio, chat, tts, transcript, vad]. - gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by mode option. Fixed pre-existing `transcribe` typo on the asr entry (loader silently dropped the unknown string → entry never surfaced as a transcript model). - gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format `<\|tool_call_start\|>[name(k="v")]<\|tool_call_end\|>` matching common_chat_params_init_lfm2 in vendored llama.cpp. - core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice preferences plumbed through to options. - core/gallery/importers/importers.go — register LiquidAudioImporter before LlamaCPPImporter. - pkg/functions/parse_lfm2_test.go — seven specs for the response/argument regex pair on the LFM2 pythonic format. Build matrix - .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13, l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12 is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's 3.12 floor). - backend/index.yaml — anchor + 13 image entries. - Makefile — .NOTPARALLEL, prepare-test-extra, test-extra, docker-build-liquid-audio. Docs - .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands) remain. - .agents/api-endpoints-and-auth.md — expand the capability-surface checklist with every place a new FLAG_* needs to be registered. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): function calling + history cap for any-to-any models Three pieces, all on the realtime_audio path that just landed: 1. liquid-audio backend (backend/python/liquid-audio/backend.py): - _build_chat_state grows a `tools_prelude` arg. - new _render_tools_prelude parses request.Tools (the OpenAI Chat Completions function array realtime.go already serialises) and emits an LFM2 `<\|tool_list_start\|>…<\|tool_list_end\|>` system turn ahead of the user history. Mirrors gallery/lfm.yaml's `function:` template so the model sees the same prompt shape whether served via llama-cpp or here. Without this the backend silently dropped tools — function calling was wired end-to-end on the Go side but the model never saw a tool list. 2. Realtime history cap (core/http/endpoints/openai/realtime.go): - Session grows MaxHistoryItems int; default picked by new defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5 1.5B degrades quickly past a handful of turns), 0/unlimited for legacy pipelines composing larger LLMs. - triggerResponse runs conv.Items through trimRealtimeItems before building conversationHistory. Helper walks the cut left if it would orphan a function_call_output, so tool result + call pairs stay intact. - realtime_gate_test.go: specs for defaultMaxHistoryItems and trimRealtimeItems (zero cap, under cap, over cap, tool-call pair preservation). 3. Talk page (core/http/react-ui/src/pages/Talk.jsx): - Reuses the chat page's MCP plumbing — useMCPClient hook, ClientMCPDropdown component, same auto-connect/disconnect effect pattern. No bespoke tool registry, no new REST endpoints; tools come from whichever MCP servers the user toggles on, exactly as on the chat page. - sendSessionUpdate now passes session.tools=getToolsForLLM(); the update re-fires when the active server set changes mid-session. - New response.function_call_arguments.done handler executes via the hook's executeTool (which round-trips through the MCP client SDK), then replies with conversation.item.create {type:function_call_output} + response.create so the model completes its turn with the tool output. Mirrors chat's client-side agentic loop, translated to the realtime wire shape. UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes react-ui/dist into the runtime image). Backend.py changes can be swapped live in /backends/<id>/backend.py + /backend/shutdown. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page Mirrors the chat-page metadata.localai_assistant flow so users can ask the realtime model what's loaded / installed / configured. Tools are run server-side via the same in-process MCP holder that powers the chat modality — no transport switch, no proxy, no new wire protocol. Wire: - core/http/endpoints/openai/realtime.go: - RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin helper mirrors chat.go's requireAssistantAccess (no-op when auth disabled, else requires auth.RoleAdmin). - Session grows AssistantExecutor mcpTools.ToolExecutor. - runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin, fail closed if DisableLocalAIAssistant or the holder has no tools, DiscoverTools and inject into session.Tools, prepend holder.SystemPrompt() to instructions. - Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip the function_call_arguments client emit (the client can't execute these — it doesn't know about them). After the loop, if any assistant tool ran, trigger another response so the model speaks the result. Mirrors chat's agentic loop, driven server-side rather than via client round-trip. - core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest gains `localai_assistant` (JSON omitempty). Handshake calls isCurrentUserAdmin and builds RealtimeSessionOptions. - core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode" checkbox under the Tools dropdown; passes localai_assistant: true to realtimeApi.call's body, captured in the connect callback's deps. Mirroring chat's pattern means the in-process MCP tools surface "just works" for the Talk page without exposing a Streamable-HTTP MCP endpoint (which was the alternative). Clients with their own MCP servers can still use the existing ClientMCPDropdown path in parallel; the realtime handler distinguishes them by AssistantExecutor.IsTool() at dispatch time. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(realtime): render Manage Mode tool calls in the Talk transcript Previously the realtime endpoint only emitted response.output_item.added for the FunctionCall item, and Talk.jsx's switch ignored the event — so server-side tool runs were invisible in the UI. The model would speak the result but the user had no way to see what tool was actually called. realtime.go: after executing an assistant tool inproc, emit a second output_item.added/.done pair for the FunctionCallOutput item. Mirrors the way the chat page displays tool_call + tool_result blocks. Talk.jsx: handle both response.output_item.added and .done. Render FunctionCall (with arguments) and FunctionCallOutput (pretty-printed JSON when possible) as two transcript entries — `tool_call` with the wrench icon, `tool_result` with the clipboard icon, both in mono-space secondary-colour. Resets streamingRef after the result so the next assistant text delta starts a fresh transcript entry instead of appending to the previous turn. Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools Fallout from a review pass on the Manage Mode patches: - Bound the server-side agentic loop. triggerResponse used to recurse on executedAssistantTool with no cap — a model that kept calling tools would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors useChat.js's maxToolTurns). Public triggerResponse is now a thin shim over triggerResponseAtTurn(toolTurn int); recursion increments the counter and stops at the cap with an xlog.Warn. - Preserve Manage Mode tools across client session.update. The handler used to blindly overwrite session.Tools, so toggling a client MCP server mid-session silently wiped the in-process admin tools. Session now caches the original AssistantTools slice at session creation and the session.update handler merges them back in (client names win on collision — the client is explicit). - strconv.ParseBool for the localai_assistant query param instead of hand-rolled "1" \|\| "true". Mirrors LocalAIAssistantFromMetadata. - Talk.jsx: render both tool_call and tool_result on response.output_item.done instead of splitting them across .added and .done. The server's event pairing (added → done) stays correct; the UI just doesn't need to inspect both phases of the same item. One switch case instead of two, no behavioural change. Out of scope (noted for follow-ups): extract a shared assistant-tools helper between chat.go and realtime.go (duplication is small enough that two parallel implementations stay readable for now), and an i18n key for the Manage Mode helper text (Talk.jsx doesn't use i18n anywhere else yet). Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(test-extra): wire liquid-audio backend smoke test The backend ships test.py + a `make test` target and is listed in backend-matrix.yml, so scripts/changed-backends.js already writes a `liquid-audio=true\|false` output when files under backend/python/liquid-audio/ change. The workflow just wasn't reading it. - Expose the `liquid-audio` output on the detect-changes job - Add a tests-liquid-audio job that runs `make` + `make test` in backend/python/liquid-audio, gated on the per-backend detect flag The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode short-circuits before any HuggingFace download (backend.py:192), so the job needs neither weights nor a GPU. The full-inference path remains gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set. The four new Go test files (core/gallery/importers/liquid-audio_test.go, core/http/endpoints/openai/realtime_gate_test.go, core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go) are already picked up by the existing test.yml workflow via `make test` → `ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-13 21:57:27 +02:00
LocalAI [bot]	bc3fb16105	feat(ollama): report model capabilities + details on /api/tags and /api/show (#9766 ) Ollama-compatible clients (Open WebUI, Enchanted, ollama-grid-search, etc.) rely on the `capabilities` list and `details.{parameter_size, quantization_level,families}` fields returned by /api/tags and /api/show to decide which models are eligible for a given task -- for example to filter the "embedding model" picker. Upstream Ollama returns these; LocalAI's compat layer was leaving them empty, so embedding models were silently rejected by clients that only allow chat models for chat and only allow embedding models for embeddings. This wires up the existing config signals already present in ModelConfig: - modelCapabilities() derives the Ollama capability strings from the config: "embedding" (FLAG_EMBEDDINGS), "completion" (FLAG_CHAT / FLAG_COMPLETION), "vision" (explicit KnownUsecases bit or MMProj / multimodal template / backend media marker), "tools" (auto-detected ToolFormatMarkers, JSON/Response regex, XML format, grammar triggers), "thinking" (ReasoningConfig with reasoning not disabled) and "insert" (presence of a completion template). - modelDetailsFromModelConfig() now fills families, parameter_size and quantization_level. The latter two are parsed from the GGUF filename via regex -- conservative tokens only (Q/IQ/F16/F32/BF16 and \d+(\.\d+)?[BM] surrounded by separators) so we don't accidentally match "Qwen3" as "3B". - modelInfoFromModelConfig() exposes general.architecture and general.context_length in the new ShowResponse.model_info map. Note: HasUsecases(FLAG_VISION) cannot be used directly -- GuessUsecases has no FLAG_VISION case and returns true at the end for any chat model. hasVisionSupport() instead reads KnownUsecases explicitly plus MMProj / template / media-marker signals. Tests are written first (TDD) using Ginkgo/Gomega -- DescribeTable for the capability mapping (embedding-only, chat, vision, thinking, tools via markers, tools via JSON regex, no-capability rerank) plus integration tests against ShowModelEndpoint that round-trip JSON through a real ModelConfigLoader populated from a temp YAML file. Fixes #9760. Assisted-by: Claude Code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-12 00:16:19 +02:00
LocalAI [bot]	d892e4af80	feat: add ds4 backend (DeepSeek V4 Flash) with tool calls, thinking, KV cache (#9758 ) * test(e2e-backends): allow BACKEND_BINARY for native-built backends Adds an escape hatch for hardware-gated backends (e.g. ds4) where the model is too large for Docker build context. When BACKEND_BINARY points at a run.sh produced by 'make -C backend/cpp/<name> package', the suite skips docker image extraction and drives the binary directly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e-backends): validate BACKEND_BINARY basename + log actual source Two follow-ups from the `cbcf5148` code review: - BACKEND_BINARY now requires a path whose basename is `run.sh`. Without this check, `filepath.Dir(binary)` silently discarded the filename, so pointing the env var at an arbitrary binary failed later with a confusing assertion that named a path the user never typed. - The "Testing image=..." debug line printed an empty string when the binary path was used, hiding the actual source in CI logs. The line now reports whichever of BACKEND_IMAGE / BACKEND_BINARY is in effect as `src=...`. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): scaffold ds4 backend dir Adds prepare.sh, run.sh, and a .gitignore. CMakeLists, Makefile, and the implementation arrive in follow-up commits. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add backend Makefile Drives ds4's upstream Makefile to produce engine .o files (CUDA on Linux when BUILD_TYPE=cublas, Metal on Darwin, otherwise CPU debug path), then invokes CMake on our wrapper. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add CMakeLists for grpc-server Generates protoc stubs from backend.proto, links grpc-server.cpp + dsml_parser.cpp + dsml_renderer.cpp + kv_cache.cpp against pre-built ds4 engine .o files. DS4_GPU=cuda\|metal\|cpu selects the backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): grpc-server skeleton + module stubs The minimum that links: Backend service with Health + Free; other RPCs default to UNIMPLEMENTED. Stub headers/sources for dsml_parser, dsml_renderer, and kv_cache are in place so CMake links cleanly even before those modules ship. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement LoadModel Opens engine + creates session sized to ContextSize (default 32768). Backend is compile-time: CPU when DS4_NO_GPU, Metal on __APPLE__, else CUDA. MTP/speculative options are accepted via ModelOptions.Options[] (mtp_path, mtp_draft, mtp_margin). kv_cache_dir option is captured into g_kv_cache_dir for the cache module (Task 19 wires it in). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement TokenizeString Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Predict (plain text) Tool calls + thinking-mode split arrive in Task 13 once dsml_parser is in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement PredictStream (plain text) ChatDelta + reasoning/tool_calls split arrives in Task 14. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): implement Status RPC Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add DSML streaming parser Classifies raw model-emitted token text into CONTENT / REASONING / TOOL_START / TOOL_ARGS / TOOL_END events. Markers it watches for are the literal DSML strings rendered by ds4_server.c's prompt template (<｜DSML｜tool_calls>, <｜DSML｜invoke name=...>, <think>, etc.) - these are plain text the model emits, not special tokens. Partial markers split across token chunks are buffered until a full marker or a definitively-not-a-marker '<' is observed. RandomToolId() generates the API-side tool call id (call_xxx) that exact-replay would key on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): split hex escapes in DSML markers + add cstring/cstdio includes C++ \x hex escapes have no length cap. '\x9cD' was read as a single escape producing byte 0xCD, eating the 'D'. The markers were never actually matching the DSML text the model emits. Split each escape with adjacent string literal concatenation so the byte sequence is exactly EF BD 9C 44 (｜D) at runtime. Also adds <cstring> and <cstdio> includes (libstdc++ 13 does not transitively expose std::strlen / std::snprintf via <string>). The local plan file (uncommitted) was also updated with the same fixes so Task 16's dsml_renderer.cpp does not re-introduce the bug. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into Predict (ChatDelta) Non-streaming Predict now emits one ChatDelta carrying content, reasoning_content, and tool_calls[] parsed from the model's DSML output. Reply.message still carries the raw model bytes for backends that prefer the regex fallback path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire DsmlParser into PredictStream Per-token ChatDelta writes: content/reasoning_content go incrementally, tool_calls emit TOOL_START as one delta (id + name) followed by TOOL_ARGS deltas with incremental JSON. The Go-side aggregator (pkg/functions/chat_deltas.go) reassembles them. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): chat template + reasoning_effort mapping UseTokenizerTemplate=true + Messages -> ds4_chat_begin / append / assistant_prefix. PredictOptions.Metadata['enable_thinking'] and ['reasoning_effort'] map to ds4_think_mode (DS4_THINK_HIGH default; 'max'/'xhigh' -> DS4_THINK_MAX; disabled -> DS4_THINK_NONE). Tool-call rendering for assistant turns with tool_calls JSON arrives in the next commit (dsml_renderer). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): render assistant tool_calls + tool results to DSML Closes the round-trip: when an OpenAI client sends a multi-turn chat where prior turns contain tool_calls or role=tool messages, build_prompt serializes them back to the DSML shape the model was trained on. Mirrors ds4_server.c's prompt renderer; uses nlohmann::json for parsing the OpenAI tool_calls payload. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): disk KV cache module Dir-based cache keyed by SHA1(rendered prompt prefix). File format: 'DS4G' magic + version + ctx_size + prefix_len + prefix + payload_bytes + ds4_session_save_payload output. NOT bit-compatible with ds4-server's KVC files - that interop is a follow-up plan. LoadLongestPrefix walks the dir picking the longest stored prefix that prefixes the incoming prompt. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): wire KvCache into Predict/PredictStream LoadModel reads 'kv_cache_dir' from ModelOptions.Options[], passes it to g_kv_cache.SetDir. Each Predict/PredictStream computes a render text for the request, tries LoadLongestPrefix to recover state, then Saves the new state after generation. ds4_session_sync handles the live-cache fast path internally, so the disk cache only matters for cold-starts and cross-session reuse. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): add package.sh Linux: bundles libc + ld + libstdc++ + libgomp + GPU runtime libs into package/lib so the FROM scratch image boots without a host libc. Darwin is handled by scripts/build/ds4-darwin.sh which uses otool -L. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): rename namespace ds4_backend -> ds4cpp ds4.h defines 'typedef enum {...} ds4_backend' which collides with our C++ 'namespace ds4_backend' anywhere a TU includes both. kv_cache.h includes ds4.h directly and surfaces the conflict immediately; other TUs would hit it once gRPC dev headers are available. Renames the C++ namespace to ds4cpp across all wrapper files and the plan, leaving the upstream ds4 typedef untouched. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend): add Dockerfile.ds4 Single-stage builder (CUDA devel image for cublas, ubuntu:24.04 for cpu) -> FROM scratch with packaged grpc-server + bundled runtime libs. nlohmann-json3-dev is required for dsml_renderer's JSON handling. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(make): wire backend/cpp/ds4 + ds4-darwin into root Makefile BACKEND_DS4 entry + generate-docker-build-target eval + docker-build-ds4 in docker-build-backends + .NOTPARALLEL guards. Also adds the backends/ds4-darwin target which delegates to scripts/build/ds4-darwin.sh (landed in Task 24). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci: add backend-matrix entries for ds4 (cpu + cuda13, per-arch) Two entries per build (amd64 + arm64) so backend-merge-jobs assembles a multi-arch manifest. Skipping cuda12 - ds4 was validated against CUDA 13. Darwin Metal is handled outside this matrix by backend_build_darwin.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add ds4 meta + image entries cpu + cuda13 x latest + master. Darwin Metal builds publish under ds4-darwin via the existing llama-cpp-darwin OCI pipeline. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(scripts/build): add ds4-darwin.sh Native macOS/Metal build for the ds4 backend. Mirrors llama-cpp-darwin.sh: make grpc-server -> otool -L for dylib bundling -> OCI tar that 'local-ai backends install' consumes via the backends/ds4-darwin Makefile target. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(darwin): build ds4-darwin in backend_build_darwin Adds a 'Build ds4 backend (Darwin Metal)' step that runs the backends/ds4-darwin Makefile target on the macOS runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(import): auto-detect ds4 weights via DS4Importer Adds core/gallery/importers/ds4.go which matches on the antirez/deepseek-v4-gguf repo URI and the DeepSeek-V4-Flash-.gguf filename pattern. Registered before LlamaCPPImporter so ds4 weights route to backend: ds4 instead of falling through to llama-cpp. Also lists ds4 in /backends/known so the /import-model UI surfaces it as a manual choice for users who want to force the backend on a non-canonical URI. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(gallery): add deepseek-v4-flash-q2 (ds4 backend) One-click install of the q2 weights with backend: ds4. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(.agents): add ds4-backend.md Documents the backend shape, DSML state machine, thinking-mode mapping, disk KV cache, build matrix (cpu/cuda13/Darwin), and the BACKEND_BINARY hardware-validation path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): pass UBUNTU_VERSION + arch env vars to install-base-deps The .docker/install-base-deps.sh script needs UBUNTU_VERSION (defaults to 2404), TARGETARCH, SKIP_DRIVERS, and APT_MIRROR/APT_PORTS_MIRROR exported into the environment so it can pick the right cuda-keyring / cudss / nvpl debs and apt mirrors. Dockerfile.ds4 was declaring some of the ARGs but not re-exporting them via ENV. Mirrors Dockerfile.llama-cpp's pattern. Without this fix 'make docker-build-ds4 BUILD_TYPE=cublas CUDA_MAJOR_VERSION=13' failed at: /usr/local/sbin/install-base-deps: line 120: UBUNTU_VERSION: unbound variable Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/index): add Metal image entries for ds4 Adds metal-ds4 + metal-ds4-development image entries pointing at quay.io/go-skynet/local-ai-backends:{latest,master}-metal-darwin-arm64-ds4 (built by scripts/build/ds4-darwin.sh on macOS arm64 runners), plus the 'metal' and 'metal-darwin-arm64' capability mappings on the ds4 meta and ds4-development variant. Closes a gap from the initial Task 23 landing - the Darwin Metal build script and CI workflow step were already wired (Tasks 24-25), but the gallery had no image entry for users to install the Metal variant. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(ci): use ubuntu:24.04 base for ds4 cuda13 matrix entries The initial Task 22 matrix landing used base-image: 'nvidia/cuda:13.0.0-devel-ubuntu24.04' which clashes with install-base-deps.sh's cuda-keyring step: E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa/ The canonical pattern (llama-cpp, ik-llama-cpp, turboquant) uses plain 'ubuntu:24.04' + 'skip-drivers: false' so install-base-deps installs CUDA from scratch via its own keyring setup. Adopting that here. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): drop install-base-deps.sh dependency The .docker/install-base-deps.sh pipeline is built around the llama-cpp needs: NVIDIA keyring + cuda-toolkit apt + gRPC-from-source build at /opt/grpc. For ds4 we don't need any of that: - CUDA: nvidia/cuda:13.0.0-devel-ubuntu24.04 ships /usr/local/cuda ready to go; install-base-deps's keyring step then conflicts with the pre-installed Signed-By. - gRPC: ds4's grpc-server.cpp only links against grpc++; system libgrpc++-dev (apt) is sufficient, no source build needed. Replaced the install-base-deps invocation in Dockerfile.ds4 with a direct 'apt-get install libgrpc++-dev libprotobuf-dev protobuf-compiler-grpc nlohmann-json3-dev cmake build-essential pkg-config git'. Matrix entries back to nvidia/cuda base + skip-drivers=true so install-base-deps would no-op even if some downstream tooling calls it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): correct proto accessors + alias grpc::Status as GStatus Two compile bugs caught by the docker build: 1. proto::Message uses snake_case accessors. The build_prompt loop called m.toolcalls() / m.toolcallid() - the protoc-generated names are m.tool_calls() / m.tool_call_id(). Plan-text bug propagated to the wrapper. 2. The Status RPC method shadowed the 'using grpc::Status' alias, so any later method declaration using Status as a return type failed to parse ('Status does not name a type' starting at LoadModel). Solution: alias grpc::Status as GStatus instead, with no 'using' clause that would conflict. All RPC method declarations and return-statement constructions now use GStatus. Pre-existing code reviewer flagged the Status-shadow concern as 'minor' in the original Task 10 commit; it turned out to be a real compile blocker under libstdc++ 13 once the surrounding methods were filled in. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): preserve TOOL_ARGS content in dsml_parser Flush When the model emitted a parameter value that arrived in the same buffer as the surrounding tool_call markers (e.g. the buffered tail after a literal '</think>' opened the model output), the parser deferred all buffered bytes to Flush() because looks_like_prefix() always returns true while buf starts with '<'. Flush() then drained the buffer as plain CONTENT/REASONING regardless of parser state, so the bytes between the parameter open and close markers were classified as CONTENT instead of TOOL_ARGS. Symptom: the model emitted <\|DSML\|parameter name="location" string="true">Paris, France</\|DSML\|parameter> and the assembled tool_call arguments came out as {"location":""} - the opener and closer were emitted into the args stream but the "Paris, France" content went to the assistant message instead. Fix: 1. Flush() now uses the same state-aware emit logic as DrainPlain: PARAM_VALUE bytes become TOOL_ARGS (json-escaped when string), THINK bytes become REASONING, TEXT bytes become CONTENT, and INVOKE / TOOL_CALLS structural whitespace is discarded. 2. looks_like_prefix() restricts its leading-'<' fallback to buffers that have not yet seen a '>'. Without that change, char-by-char feeds would discard the '<' of '<\|DSML\|invoke name="..."' once the marker prefix length was reached but the closing quote/'>' were still in flight. Verified with a standalone harness that runs the failing input three ways (single Feed, split-after-'>', and char-by-char) and aggregates TOOL_ARGS for tool index 0: all three now produce {"location":"Paris, France"}. Assisted-by: Claude:opus-4.7 [Read,Edit,Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(backend/cpp/ds4): use ds4_session_sync + manual generation loop for KV persistence ds4_engine_generate_argmax() is a self-contained helper that doesn't take or update a ds4_session - it manages its own internal state. Our Predict and PredictStream methods created g_session via ds4_session_create() but then called ds4_engine_generate_argmax(), so g_session's KV state never advanced. ds4_session_payload_bytes(g_session) returned 0 and the disk KV cache save correctly rejected with 'session has no valid checkpoint to save'. Switch both RPCs to the proper session API: ds4_session_sync(g_session, &prompt, ...) loop: int token = ds4_session_argmax(g_session) if token == eos: break emit(token) ds4_session_eval(g_session, token, ...) After the loop the session has a real checkpoint and ds4_session_save_payload writes the KV state to disk. Verified end-to-end on a DGX Spark GB10: three .kv files (15-30 MB each) are written when BACKEND_TEST_OPTIONS sets kv_cache_dir, and the e2e tool-call assertion still passes. Also added stderr diagnostics to KvCache (enabled/disabled at SetDir; per-save path + payload_bytes + result) so future failures are visible instead of silent. The 'wrote ok' lines are low-volume - one per Predict/PredictStream when the cache is enabled - and skipped entirely when the option is unset. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): use ds4_session_eval_speculative_argmax when MTP loaded Wires MTP (Multi-Token Prediction) speculative decoding into the manual generation loop in both Predict and PredictStream. When the upstream MTP weights are loaded via 'mtp_path:' option AND we're on CUDA / Metal, ds4_engine_mtp_draft_tokens() returns >0 and we switch the inner loop to ds4_session_eval_speculative_argmax(), which can accept N>1 tokens per verifier step. When MTP is not loaded (no option, CPU backend, or weights absent), we fall through to the simple ds4_session_argmax + ds4_session_eval path with no behavior change. Validated on a DGX Spark GB10 with the optional MTP GGUF (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, ~3.6 GB). LoadModel logs 'ds4: MTP support model loaded ... (draft=2)' on stderr. Caveat per upstream README: 'currently provides at most a slight speedup, not a meaningful generation-speed win'. Wired now mainly to track the upstream API; bigger speedups arrive when ds4 improves the speculative path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(backend/cpp/ds4): honor PredictOptions sampling with DSML-aware override Mirrors ds4_server.c:7102-7115 sampling-policy semantics on the LocalAI gRPC side. The generation loop now consults compute_sample_params() per token to pick the effective (temperature, top_k, top_p, min_p), based on: 1. Request defaults: PredictOptions.temperature / .topk / .topp / .minp 2. Thinking-mode override: when enable_thinking != false, force T=1.0, top_k=0, top_p=1.0, min_p=0.0 (creativity for the reasoning pass and the trailing content) 3. DSML structural override: when DsmlParser::IsInDsmlStructural() returns true (we are between tool-call markers but NOT in a param value payload), force T=0.0 so protocol bytes parse cleanly When the effective temperature is 0, we keep using ds4_session_argmax + MTP speculative path (matches ds4-server's gate that only enables MTP for greedy positions). When > 0, we call ds4_session_sample(s, T, ...) with a per-thread RNG seeded from system_clock and fall back to single-token ds4_session_eval. New public method on DsmlParser: IsInDsmlStructural() encodes which states need protocol-byte determinism. PARAM_VALUE is excluded (payload uses user sampling); TEXT and THINK are excluded (no tool-call context to protect). Verified on the DGX Spark GB10: the e2e suite still passes with all 5 specs including tools, and the Predict output now varies between runs (creative sampling active) while the tool-call args remain a clean '{"location":"Paris, France"}' because the parser-state check forces greedy on the structural bytes. UX note: thinking mode is ON by default (matching ds4-server). Users who want deterministic output should set Metadata.enable_thinking = false. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(gallery): add sha256 to deepseek-v4-flash-q2 entry Per HF LFS metadata for antirez/deepseek-v4-gguf: size: 86720111200 bytes (~80.76 GiB) sha256: 31598c67c8b8744d3bcebcd19aa62253c6dc43cef3b8adf9f593656c9e86fd8c LocalAI's downloader verifies sha256 when present, so users who install deepseek-v4-flash-q2 from the gallery get integrity-checked weights and the partial-download issue (an 81 GB file is easy to truncate) becomes recoverable instead of silently producing a broken backend. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-11 22:15:47 +02:00
Richard Palethorpe	670259ce43	chore: Security hardening (#9719 ) * fix(http): close 0.0.0.0/[::] SSRF bypass in /api/cors-proxy The CORS proxy carried its own private-network blocklist (RFC 1918 + a handful of IPv6 ranges) instead of using the same classification as pkg/utils/urlfetch.go. The hand-rolled list missed 0.0.0.0/8 and ::/128, both of which Linux routes to localhost — so any user with FeatureMCP (default-on for new users) could reach LocalAI's own listener and any other service bound to 0.0.0.0:port via: GET /api/cors-proxy?url=http://0.0.0.0:8080/... GET /api/cors-proxy?url=http://[::]:8080/... Replace the custom check with utils.IsPublicIP (Go stdlib IsLoopback / IsLinkLocalUnicast / IsPrivate / IsUnspecified, plus IPv4-mapped IPv6 unmasking) and add an upfront hostname rejection for localhost, .local, and the cloud metadata aliases so split-horizon DNS can't paper over the IP check. The IP-pinning DialContext is unchanged: the validated IP from the single resolution is reused for the connection, so DNS rebinding still cannot swap a public answer for a private one between validate and dial. Regression tests cover 0.0.0.0, 0.0.0.0:PORT, [::], ::ffff:127.0.0.1, ::ffff:10.0.0.1, file://, gopher://, ftp://, localhost, 127.0.0.1, 10.0.0.1, 169.254.169.254, metadata.google.internal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> fix(downloader): verify SHA before promoting temp file to final path DownloadFileWithContext renamed the .partial file to its final name before checking the streamed SHA, so a hash mismatch returned an error but left the tampered file at filePath. Subsequent code that operated on filePath (a backend launcher, a YAML loader, a re-download that finds the file already present and skips) would consume the attacker-supplied bytes. Reorder: verify the streamed hash first, remove the .partial on mismatch, then rename. The streamed hash is computed during io.Copy so no second read is needed. While here, raise the empty-SHA case from a Debug log to a Warn so "this download had no integrity check" is visible at the default log level. Backend installs currently pass through with no digest; the warning makes that footprint observable without changing behaviour. Regression test asserts os.IsNotExist on the destination after a deliberate SHA mismatch. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(auth): require email_verified for OIDC admin promotion extractOIDCUserInfo read the ID token's "email" claim but never inspected "email_verified". With LOCALAI_ADMIN_EMAIL set, an attacker who could register on the configured OIDC IdP under that email (some IdPs accept self-supplied unverified emails) inherited admin role: - first login: AssignRole(tx, email, adminEmail) → RoleAdmin - re-login: MaybePromote(db, user, adminEmail) → flip to RoleAdmin Add EmailVerified to oauthUserInfo, parse email_verified from the OIDC claims (default false on absence so an IdP that omits the claim cannot short-circuit the gate), and substitute "" for the role-decision email when verified=false via emailForRoleDecision. The user record still stores the unverified email for display. GitHub's path defaults EmailVerified=true: GitHub only returns a public profile email after verification, and fetchGitHubPrimaryEmail explicitly filters to Verified=true. Regression tests cover both the helper contract and integration with AssignRole, including the bootstrap "first user" branch that would otherwise mask the gate. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(cli): refuse public bind when no auth backend is configured When neither an auth DB nor a static API key is set, the auth middleware passes every request through. That is fine for a developer laptop, a home LAN, or a Tailnet — the network itself is the trust boundary. It is not fine on a public IP, where every model install, settings change, and admin endpoint becomes reachable from the internet. Refuse to start in that exact configuration. Loopback, RFC 1918, RFC 4193 ULA, link-local, and RFC 6598 CGNAT (Tailscale's default range) all count as trusted; wildcard binds (`:port`, `0.0.0.0`, `[::]`) are accepted only when every host interface is in one of those ranges. Hostnames are resolved and treated as trusted only when every answer is. A new --allow-insecure-public-bind / LOCALAI_ALLOW_INSECURE_PUBLIC_BIND flag opts out for deployments that gate access externally (a reverse proxy enforcing auth, a mesh ACL, etc.). The error message lists this plus the three constructive alternatives (bind a private interface, enable --auth, set --api-keys). The interface enumeration goes through a package-level interfaceAddrsFn var so tests can simulate cloud-VM, home-LAN, Tailscale-only, and enumeration-failure topologies without poking at the real network stack. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): regression-test the localai_assistant admin gate ChatEndpoint already rejects metadata.localai_assistant=true from a non-admin caller, but the gate was open-coded inline with no direct test coverage. The chat route is FeatureChat-gated (default-on), and the assistant's in-process MCP server can install/delete models and edit configs — the wrong handler change would silently turn the LLM into a confused deputy. Extract the gate into requireAssistantAccess(c, authEnabled) and pin its behaviour: auth disabled is a no-op, unauthenticated is 403, RoleUser is 403, RoleAdmin and the synthetic legacy-key admin are admitted. No behaviour change in the production path. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): assert every API route is auth-classified The auth middleware classifies path prefixes (/api/, /v1/, /models/, etc.) as protected and treats anything else as a static-asset passthrough. A new endpoint shipped under a brand-new prefix — or a new path that simply isn't on the prefix allowlist — would be reachable anonymously. Walk every route registered by API() with auth enabled and a fresh in-memory database (no users, no keys), and assert each API-prefixed route returns 401 / 404 / 405 to an anonymous request. Public surfaces (/api/auth/, /api/branding, /api/node/ token-authenticated routes, /healthz, branding asset server, generated-content server, static assets) are explicit allowlist entries with comments justifying them. Build-tagged 'auth' so it runs against the SQLite-backed auth DB (matches the existing auth suite). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin agent endpoint per-user isolation contract agents.go's getUserID / effectiveUserID / canImpersonateUser / wantsAllUsers helpers are the single trust boundary for cross-user access on agent, agent-jobs, collections, and skills routes. A regression there is the difference between "regular user reads their own data" and "regular user reads anyone's data via ?user_id=victim". Lock in the contract: - effectiveUserID ignores ?user_id= for unauthenticated and RoleUser - effectiveUserID honours it for RoleAdmin and ProviderAgentWorker - wantsAllUsers requires admin AND the literal "true" string - canImpersonateUser is admin OR agent-worker, never plain RoleUser No production change — this commit only adds tests. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(downloader): drop redundant stat in removePartialFile The stat-then-remove pattern is a TOCTOU window and a wasted syscall — os.Remove already returns ErrNotExist for the missing-file case, so trust that and treat it as a no-op. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): redact secrets from trace buffer and distribution-token logs The /api/traces buffer captured Authorization, Cookie, Set-Cookie, and API-key headers verbatim from every request when tracing was enabled. The endpoint is admin-only but the buffer is reachable via any heap-style introspection and the captured tokens otherwise outlive the request. Strip those header values at capture time. Body redaction is left to a follow-up — the prompts are usually the operator's own and JSON-walking is invasive. Distribution tokens were also logged in plaintext from core/explorer/discovery.go; logs forward to syslog/journald and outlive the token. Redact those to a short prefix/suffix instead. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): rate-limit OAuth callbacks separately from password endpoints The shared 5/min/IP limit on auth endpoints is right for password-style flows but too tight for OAuth callbacks: corporate SSO funnels many real users through one outbound IP and would trip the limit. Add a separate 60/min/IP limiter for /api/auth/{github,oidc}/callback so callbacks are bounded against floods without breaking shared-IP deployments. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(gallery): verify backend tarball sha256 when set in gallery entry GalleryBackend gained an optional sha256 field; the install path now threads it through to the existing downloader hash-verify (which already streams, verifies, and rolls back on mismatch). Galleries without sha256 keep working; the empty-SHA path still emits the existing "downloading without integrity check" warning. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(http): pin CSRF coverage on multipart endpoints The CSRF middleware in app.go is global (e.Use) so it covers every multipart upload route — branding assets, fine-tune datasets, audio transforms, agent collections. Pin that contract: cross-site multipart POSTs are rejected; same-origin / same-site / API-key clients are not. Also pins the SameSite=Lax fallback path the skipper relies on when Sec-Fetch-Site is absent. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): XSS hardening — CSP headers, safe href, base-href escape, SVG sandbox Several closely related XSS-prevention changes spanning the SPA shell, the React UI, and the branding asset server: - New SecurityHeaders middleware sets CSP, X-Content-Type-Options, X-Frame-Options, and Referrer-Policy on every response. The CSP keeps script-src permissive because the Vite bundle relies on inline + eval'd scripts; tightening that requires moving to a nonce-based policy. - The <base href> injection in the SPA shell escaped attacker-controllable Host / X-Forwarded-Host headers — a single quote in the host header broke out of the attribute. Pass through SecureBaseHref (html.EscapeString). - Three React sinks rendering untrusted content via dangerouslySetInnerHTML switch to text-node rendering with whiteSpace: pre-wrap: user message bodies in Chat.jsx and AgentChat.jsx, and the agent activity log in AgentChat.jsx. The hand-rolled escape on the agent user-message variant is replaced by the same plain-text path. - New safeHref util collapses non-allowlisted URI schemes (most importantly javascript:) to '#'. Applied to gallery `<a href={url}>` links in Models / Backends / Manage and to canvas artifact links — these come from gallery JSON or assistant tool calls and must be treated as untrusted. - The branding asset server attaches a sandbox CSP plus same-origin CORP to .svg responses. The React UI loads logos via <img>, but the same URL is also reachable via direct navigation; this prevents script execution if a hostile SVG slipped past upload validation. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(http): bound HTTP server with read-header and idle timeouts A net/http server with no timeouts is trivially Slowloris-able and leaks idle keep-alive connections. Set ReadHeaderTimeout (30s) to plug the slow-headers attack and IdleTimeout (120s) to cap keep-alive sockets. ReadTimeout and WriteTimeout stay at 0 because request bodies can be multi-GB model uploads and SSE / chat completions stream for many minutes; operators who need tighter per-request bounds should terminate slow clients at a reverse proxy. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * test(auth): pin PUT /api/auth/profile field-tampering contract The handler uses an explicit local body struct (only name and avatar_url) plus a gorm Updates(map) with a column allowlist, so an attacker posting {"role":"admin","email":"...","password_hash":"..."} can't mass-assign those fields. Lock that down with a regression test so a future "let's just c.Bind(&user)" refactor breaks loudly. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(services): strip directory components from multipart upload filenames UploadDataset and UploadToCollectionForUser took the raw multipart file.Filename and joined it into a destination path. The fine-tune upload was incidentally safe because of a UUID prefix that fused any leading '..' to a literal segment, but the protection is fragile. UploadToCollectionForUser handed the filename to a vendored backend without sanitising at all. Strip to filepath.Base at both boundaries and reject the trivial unsafe values ("", ".", "..", "/"). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): validate persisted MCP server entries on load localStorage is shared across same-origin pages; an XSS that lands once can poison persisted MCP server config to attempt header injection or to feed a non-http URL into the fetch path on subsequent loads. Validate every entry: types must match, URL must parse with http(s) scheme, header keys/values must be control-char-free. Drop anything that doesn't fit. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): close X-Forwarded-Prefix open redirect The reverse-proxy support concatenated X-Forwarded-Prefix into the redirect target without validation, so a forged header value of "//evil.com" turned the SPA-shell redirect helper at /, /browse, and /browse/* into a 301 to //evil.com/app. The path-strip middleware had the same shape on its prefix-trailing-slash redirect. Add SafeForwardedPrefix at the middleware boundary: must start with a single '/', no protocol-relative '//' opener, no scheme, no backslash, no control characters. Apply at both consumers; misconfig trips the validator and the header is dropped. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(http): refuse wildcard CORS when LOCALAI_CORS=true with empty allowlist When LOCALAI_CORS=true but LOCALAI_CORS_ALLOW_ORIGINS was empty, Echo's CORSWithConfig saw an empty allow-list and fell back to its default AllowOrigins=[""]. An operator who flipped the strict-CORS feature flag without populating the list got the opposite of what they asked for. Echo never sets Allow-Credentials: true so this isn't directly exploitable (cookies aren't sent under wildcard CORS), but the misconfiguration trap is worth closing. Skip the registration and warn. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> feat(auth): zxcvbn password strength check with user-acknowledged override The previous policy was len < 8, which let through "Password1" and the rest of the credential-stuffing corpus. LocalAI has no second factor yet, so the bar needs to sit higher. Add ValidatePasswordStrength using github.com/timbutler/zxcvbn (an actively-maintained fork of the trustelem port; v1.0.4, April 2024): - min 12 chars, max 72 (bcrypt's truncation point) - reject NUL bytes (some bcrypt callers truncate at the first NUL) - require zxcvbn score >= 3 ("safely unguessable, ~10^8 guesses to break"); the hint list ["localai", "local-ai", "admin"] penalises passwords built from the app's own branding zxcvbn produces false positives sometimes (a strong-looking password that happens to match a dictionary word) and operators occasionally need to set a known-weak password (kiosk demos, CI rigs). Add an acknowledgement path: PasswordPolicy{AllowWeak: true} skips the entropy check while still enforcing the hard rules. The structured PasswordErrorResponse marks weak-password rejections as Overridable so the UI can surface a "use this anyway" checkbox. Wired through register, self-service password change, and admin password reset on both the server and the React UI. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): drop HTML5 minLength on new-password inputs minLength={12} on the new-password input let the browser block the form submit silently before any JS or network call ran. The browser focused the field, showed a brief native tooltip, and that was that — no toast, no fetch, no clue. Reproducible by typing fewer than 12 chars on the second password change of a session. The JS-level length check in handleSubmit already shows a toast and the server rejects with a structured error, so the HTML5 attribute was redundant defence anyway. Drop it. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): bundle Geist fonts locally instead of fetching from Google The new CSP correctly refused to apply styles from fonts.googleapis.com because style-src is locked to 'self' and 'unsafe-inline'. Loosening the CSP would defeat its purpose; the right fix is to stop reaching out to a third-party CDN for fonts on every page load. Add @fontsource-variable/geist and @fontsource-variable/geist-mono as npm deps and import them once at boot. Drop the <link rel="preconnect"> and external stylesheet from index.html. Side benefit: no third-party tracking via Referer / IP on every UI load, no failure mode when offline / behind a captive portal. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * fix(react-ui): refresh i18n strings to reflect 12-char password minimum The translations still said "at least 8 characters" everywhere — the client-side toast on a too-short password change told the user the wrong floor. Update tooShort and newPasswordPlaceholder / newPasswordDescription across all five locales (en, es, it, de, zh-CN) to match the real ValidatePasswordStrength rule. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(auth): make password length-floor overridable like the entropy check The 12-char minimum was a policy choice, not a technical invariant — only "non-empty", "<= 72 bytes", and "no NUL bytes" are real bcrypt constraints. Treating length-12 as a hard rule was inconsistent with the entropy check (already overridable) and friction for use cases where the account is just a name on a session, not a security boundary (single-user kiosk, CI rig, lab demo). Restructure ValidatePasswordStrength: - Hard rules (always enforced): non-empty, <= MaxPasswordLength, no NUL byte - Policy rules (skipped when AllowWeak=true): length >= 12, zxcvbn score >= 3 PasswordError now marks password_too_short as Overridable too. The React forms generalised from `error_code === 'password_too_weak'` to `overridable === true`, and the JS-side preflight length checks were removed (server is source of truth, returns the same checkbox flow). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-08 16:25:45 +02:00
LocalAI [bot]	e5d7b84216	fix(distributed): split NATS backend.upgrade off install + dedup loads (#9717 ) * feat(messaging): add backend.upgrade NATS subject + payload types Splits the slow force-reinstall path off backend.install so it can run on its own subscription goroutine, eliminating head-of-line blocking between routine model loads and full gallery upgrades. Wire-level Force flag on BackendInstallRequest is kept for one release as the rolling-update fallback target; doc note marks it deprecated. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed/worker): add per-backend mutex helper to backendSupervisor Different backend names lock independently; same backend serializes. This is the synchronization primitive used by the upcoming concurrent install handler — without it, wrapping the NATS callback in a goroutine would race the gallery directory when two requests target the same backend. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed/worker): run backend.install handler in a goroutine NATS subscriptions deliver messages serially on a single per-subscription goroutine. With a synchronous install handler, a multi-minute gallery download would head-of-line-block every other install request to the same worker — manifesting upstream as a 5-minute "nats: timeout" on unrelated routine model loads. The body now runs in its own goroutine, with a per-backend mutex (lockBackend) protecting the gallery directory from concurrent operations on the same backend. Different backend names install in parallel. Backward-compat: req.Force=true is still honored here, so an older master that hasn't been updated to send on backend.upgrade keeps working. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed/worker): subscribe to backend.upgrade as a separate path Slow force-reinstall now lives on its own NATS subscription, so a multi-minute gallery pull cannot head-of-line-block the routine backend.install handler on the same worker. Same per-backend mutex guards both — concurrent install + upgrade for the same backend serialize at the gallery directory; different backends are independent. upgradeBackend stops every live process for the backend, force-installs from gallery, and re-registers. It does not start a new process — the next backend.install will spawn one with the freshly-pulled binary. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): add UpgradeBackend on NodeCommandSender; drop Force from InstallBackend Master now sends to backend.upgrade for force-reinstall, with a nats.ErrNoResponders fallback to the legacy backend.install Force=true path so a rolling update with a new master + an old worker still converges. The Force parameter leaves the public Go API surface entirely — only the internal fallback sets it on the wire. InstallBackend timeout drops 5min -> 3min (most replies are sub-second since the worker short-circuits on already-running or already-installed). UpgradeBackend timeout is 15min, sized for real-world Jetson-on-WiFi gallery pulls. Updates the admin install HTTP endpoint (core/http/endpoints/localai/nodes.go) to the new signature too. router_test.go's fakeUnloader does not yet implement the new interface shape; Task 3.2 will catch it up before the next package-level test run. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): update fakeUnloader for new NodeCommandSender shape InstallBackend lost its force bool param (Force is not part of the public Go API anymore — only the internal upgrade-fallback path sets it on the wire). UpgradeBackend gained a method. Fake records both call slices and provides an installHook concurrency seam for upcoming singleflight tests. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): cover UpgradeBackend's new subject + rolling-update fallback Task 3.1 changed the master to publish UpgradeBackend on the new backend.upgrade subject; the existing UpgradeBackend tests scripted the old install subject and so all 3 began failing as expected. Updates them to script SubjectNodeBackendUpgrade with BackendUpgradeReply. Adds two new specs for the rolling-update fallback: - ErrNoResponders on backend.upgrade triggers a backend.install Force=true retry on the same node. - Non-NoResponders errors propagate to the caller unchanged. scriptedMessagingClient gains scriptNoResponders (real nats sentinel) and scriptReplyMatching (predicate-matched canned reply, used to assert that the fallback path actually sets Force=true on the install retry). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): coalesce concurrent identical backend.install via singleflight Six simultaneous chat completions for the same not-yet-loaded model were observed firing six independent NATS install requests, each serializing through the worker's per-subscription goroutine and amplifying queue depth. SmartRouter now wraps the NATS round-trip in a singleflight.Group keyed by (nodeID, backend, modelID, replica): N concurrent identical loads share one round-trip and one reply. Distinct (modelID, replica) keys still fire independent calls, so multi-replica scaling and multi-model fan-out are unaffected. fakeUnloader gains a sync.Mutex around its recording slices to keep concurrent test goroutines race-clean. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(e2e/distributed): drop force arg from InstallBackend test calls Two e2e test call sites still passed the trailing force bool that was removed from RemoteUnloaderAdapter.InstallBackend in `9bde76d7`. Caught by golangci-lint typecheck on the upgrade-split branch (master CI was already green because these tests don't run in the standard test path). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract worker business logic to core/services/worker core/cli/worker.go grew to 1212 lines after the backend.upgrade split. The CLI package was carrying backendSupervisor, NATS lifecycle handlers, gallery install/upgrade orchestration, S3 file staging, and registration helpers — all distributed-worker business logic that doesn't belong in the cobra surface. Move it to a new core/services/worker package, mirroring the existing core/services/{nodes,messaging,galleryop} pattern. core/cli/worker.go shrinks to ~19 lines: a kong-tagged shim that embeds worker.Config and delegates Run. No behavior change. All symbols stay unexported except Config and Run. The three worker-specific tests (addr/replica/concurrency) move with the code via git mv so history follows them. Files split as: worker.go - Run entry point config.go - Config struct (kong tags retained, kong not imported) supervisor.go - backendProcess, backendSupervisor, process lifecycle install.go - installBackend, upgradeBackend, findBackend, lockBackend lifecycle.go - subscribeLifecycleEvents (verbatim, decomposition is a follow-up commit) file_staging.go - subscribeFileStaging, isPathAllowed registration.go - advertiseAddr, registrationBody, heartbeatBody, etc. reply.go - replyJSON process_helpers.go - readLastLinesFromFile Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed/worker): decompose subscribeLifecycleEvents into per-event handlers The 226-line subscribeLifecycleEvents method packed eight NATS subscriptions inline. Each grew context-shaped doc comments mixed with subscription plumbing, making it hard to read any one handler without scrolling past the others. Extract each handler into its own method on *backendSupervisor; the subscriber becomes a thin 8-line dispatcher. No behavior change: each method body is byte-equivalent to its corresponding inline goroutine + handler. Doc comments that were attached to the inline SubscribeReply calls migrate to the new method godocs. Adding the next NATS subject is now a 2-line patch to the dispatcher plus one new method, instead of grafting onto a monolith. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 16:24:54 +02:00
LocalAI [bot]	2be07f61da	feat(whisper): honor client cancellation via ggml abort_callback (#9710 ) * refactor(transcription): propagate request ctx through ModelTranscription* Replaces context.Background() with the HTTP request ctx so client disconnects start cancelling the gRPC call. No backend-side abort wiring yet — that comes in a later commit. Pure plumbing. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(cli): pass ctx to backend.ModelTranscription Follow-up to `e65d3e1f` which threaded ctx through ModelTranscription but missed the CLI caller. CLI commands have no request-scoped ctx, so context.Background() is correct here. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(audio): propagate request ctx into TTS, sound-gen, audio-transform Same ctx-plumbing pattern applied to the rest of the audio path. CLI callers use context.Background() since there is no request scope; HTTP callers use c.Request().Context(). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(backend): propagate request ctx into biometric, detection, rerank, diarization paths Replaces remaining context.Background() sites in core/backend with the caller's ctx. After this commit, every core/backend/.go entry point threads the request ctx end-to-end to the gRPC client. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> refactor(grpc): plumb ctx through AIModel.AudioTranscription{,Stream} Adds context.Context as first parameter to the AIModel interface methods that wrap whisper-style transcription. Server-side gRPC handler now forwards the per-RPC ctx (server-streaming uses stream.Context()). Whisper, Voxtral, vibevoice-cpp, and sherpa-onnx accept the parameter; none uses it yet — the actual cancellation primitive lands in the next commit so this is pure plumbing. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): add abort_callback hook in the C++ bridge Installs a std::atomic<int> flag, wires it into whisper_full_params.abort_callback, and exposes a set_abort(int) C symbol so Go can flip the flag from a goroutine watching the request context. transcribe() now distinguishes abort (return 2) from real whisper_full failure (return 1). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): register set_abort symbol in the purego loader Adds the Go-side binding for the new C export so the next commit can call CppSetAbort(1) from a watcher goroutine on ctx.Done(). Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(whisper): honor ctx cancellation and return codes.Canceled A watcher goroutine watches ctx.Done() during AudioTranscription and calls CppSetAbort(1) on cancel. whisper_full sees abort_callback return true at the next compute graph step, returns non-zero, and the bridge returns 2 -> AudioTranscription maps that to codes.Canceled. Adds an opt-in test (gated on WHISPER_MODEL_PATH / WHISPER_AUDIO_PATH) that asserts cancellation latency under 5s and proves the abort flag resets cleanly so the next transcription succeeds. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper): join the cancel watcher goroutine before returning Follow-up to `85edf9d2`. The previous commit used `defer close(done)` and called the watcher "joined synchronously" — but close() only signals, it does not block until the goroutine exits. That left a window where a late CppSetAbort(1) from a cancelled call could land on the next call, after its C-side g_abort reset but before whisper_full() began polling the abort callback, corrupting the second transcription. Switch to a sync.WaitGroup join so wg.Wait() blocks until the watcher has actually returned from its select. Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(whisper): short-circuit pre-cancelled ctx in AudioTranscription If ctx is already Done() at entry, return codes.Canceled immediately instead of running the full transcription. The C-side g_abort reset happens at the start of transcribe() and would otherwise overwrite a watcher-set abort flag from an already-cancelled ctx, producing a spurious successful transcription on a request the client has already abandoned. Assisted-by: Claude:claude-haiku-4-5 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(tests/distributed): update testLLM mock for new AudioTranscription signature Phase B (`93c48e19`) added context.Context to AIModel.AudioTranscription but missed the testLLM mock in tests/e2e/distributed. CI golangci-lint caught it: testLLM did not implement grpc.AIModel because the method signature lacked the ctx parameter, which broke the distributed test suite compilation and cascaded through every backend-build job that runs `go build ./...`. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> test(whisper): port cancellation test to Ginkgo/Gomega Project policy (.agents/coding-style.md, enforced by golangci-lint forbidigo) is that all Go tests must use Ginkgo v2 + Gomega — no stdlib testing patterns (t.Skip, t.Fatalf, etc.). Convert the cancellation test to a Describe/It block with Skip(...) for env gating and Expect/HaveOccurred for assertions. Same coverage: cancel mid-flight returns codes.Canceled within 5s and a follow-up transcription succeeds, proving the C-side g_abort flag resets cleanly. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-08 01:44:47 +02:00
LocalAI [bot]	595b6fd22d	feat(api/transcription): include segments + duration + language on stream done event (#9709 ) streamTranscription previously emitted a done event with just `text`, matching the OpenAI streaming spec exactly. Streaming clients that need per-utterance timings or audio duration had to fall back to the non-streaming JSON path — and that path is exactly the one that trips on ResponseHeaderTimeout when whisper requests queue behind each other on a SingleThread backend. Extend the done event to additively carry `language`, `duration`, and a `segments` array (id, start, end, text — start/end as float seconds, matching TranscriptionSegmentSeconds). Empty / zero values are still omitted; spec-compliant clients ignore the new fields. This unblocks notary's streaming Transcribe (companion change in the notary repo) so it produces the same TranscriptionResult shape as the JSON path while sidestepping the queue-induced header timeouts. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:28:26 +02:00
LocalAI [bot]	447c186089	fix(distributed): make backend upgrade actually re-install on workers (#9708 ) * fix(distributed): make backend upgrade actually re-install on workers UpgradeBackend dispatched a vanilla backend.install NATS event to every node hosting the backend. The worker's installBackend short-circuits on "already running for this (model, replica) slot" and returns the existing address — so the gallery install path was skipped, no artifact was re-downloaded, no metadata was written. The frontend's drift detection then re-flagged the same backends every cycle (installedDigest stays empty → mismatch → "Backend upgrade available (new build)") while "Backend upgraded successfully" landed in the logs at the same time. The user-visible symptom: clicking "Upgrade All" silently does nothing and the same N backends sit on the upgrade list forever. Two coupled fixes, one PR: 1. Force flag on backend.install. Add `Force bool` to BackendInstallRequest and thread it through NodeCommandSender -> RemoteUnloaderAdapter. UpgradeBackend (and the reconciler's pending-op drain when retrying an upgrade) sets force=true; routine load events and admin install endpoints keep force=false. On the worker, force=true stops every live process that uses this backend (resolveProcessKeys for peer replicas, plus the exact request processKey), skips the findBackend short-circuit, and passes force=true into gallery.InstallBackendFromGallery so the on-disk artifact is overwritten. After the gallery install completes, startBackend brings up a fresh process at the same processKey on a new port. 2. Liveness check on the fast path. installBackend's "already running" branch read getAddr without verifying the process was alive, so a gRPC backend that died without the supervisor noticing left a stale (key, addr) entry. The reconciler then dialed that address, got ECONNREFUSED, marked the replica failed, retried install — and the supervisor said "already running addr=…" again. Loop forever, exactly what we observed on a node whose llama-cpp process had died but whose supervisor record persisted. Verify s.isRunning(processKey) before trusting getAddr; if the entry is stale, stopBackendExact cleans up and we fall through to a real install. Backwards-compatible: the new Force field is omitempty, older workers ignore it (their default behavior matches force=false). The signature change on NodeCommandSender.InstallBackend is internal-only. Verified: unit tests in core/services/nodes pass (108s suite). The pre-existing core/backend build break (proto regen pending for word-level timestamps) blocks core/cli and core/http/endpoints/localai package tests but is unrelated to this change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * test(e2e/distributed): pass force=false to adapter.InstallBackend NodeCommandSender.InstallBackend gained a final force bool in the upgrade-force commit; the e2e distributed lifecycle tests still called the old 8-arg signature and broke compilation. These tests exercise the routine install path (single replica, default behavior), so force=false preserves their existing semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:28:14 +02:00
LocalAI [bot]	cec5c4fdfc	fix(http): make handler-error status visible in access log + transcription errors (#9707 ) * fix(http): log accurate status code when handler returns error The custom xlog access-log middleware in API() reads res.Status before Echo's central HTTPErrorHandler runs, so when a handler returns an error without writing a response (e.g. TranscriptEndpoint's `return err` on backend failure) the status field stays at its default 200. The logged line then claims status=200 while the client receives 500 — silently hiding every 500/503/etc. that bubbles up through Echo's error handler. Mirror echo.DefaultHTTPErrorHandler's status derivation when err != nil and the response hasn't been committed: default to 500, upgrade to echo.HTTPError.Code if applicable. The logged status now matches what the client actually sees, so failed transcription requests stop appearing as 200 in the access log. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] fix(transcription): log underlying error before returning 500 to client ModelTranscriptionWithOptions surfaces real failures — gRPC errors from a remote node, model load problems, ffmpeg conversion crashes — but TranscriptEndpoint just did `return err`, so Echo turned it into a 500 with a generic body and the original error was lost. Operators chasing transcription failures across distributed mode were left with "upstream returned 500" on the client and zero context anywhere in the frontend's logs. Add an xlog.Error before returning, recording model name, the staged audio path, and the underlying error. Combined with the access-log status fix, a failing transcription now leaves an audit trail (real status code in the access line, real cause in an Error line) instead of vanishing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-07 17:27:45 +02:00
Richard Palethorpe	969005b2a1	feat(gallery): Speed up load times and clean gallery entries (#9211 ) * feat: Rework VRAM estimation and use known_usecases in gallery Signed-off-by: Richard Palethorpe <io@richiejp.com> Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code] * chore(gallery): regenerate gallery index and add known_usecases to model entries Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-06 14:51:38 +02:00
Andreas Egli	af83518532	feat: support word-level timestamps for faster-whisper (#9621 ) Signed-off-by: Andreas Egli <github@kharan.ch> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-05-06 00:32:52 +02:00
Ettore Di Giacinto	e86ade54a6	feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp (#9654 ) * feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp Closes #1648. OpenAI-style multipart endpoint that returns "who spoke when". Single endpoint instead of the issue's three-endpoint sketch (refactor /vad, /vad/embedding, /diarization) — the typical client wants one call, and embeddings can land later as a sibling without breaking this surface. Response shape borrows from Pyannote/Deepgram: segments carry a normalised SPEAKER_NN id (zero-padded, stable across the response) plus the raw backend label, optional per-segment text when the backend bundles ASR, and a speakers summary in verbose_json. response_format also accepts rttm so consumers can pipe straight into pyannote.metrics / dscore. Backends: * vibevoice-cpp — Diarize() reuses the existing vv_capi_asr pass. vibevoice's ASR prompt asks the model to emit [{Start,End,Speaker,Content}] natively, so diarization is a by-product of the same pass; include_text=true preserves the transcript per segment, otherwise we drop it. * sherpa-onnx — wraps the upstream SherpaOnnxOfflineSpeakerDiarization C API (pyannote segmentation + speaker-embedding extractor + fast clustering). libsherpa-shim grew config builders, a SetClustering wrapper for per-call num_clusters/threshold overrides, and a segment_at accessor (purego can't read field arrays out of SherpaOnnxOfflineSpeakerDiarizationSegment[] directly). Plumbing: new Diarize gRPC RPC + DiarizeRequest / DiarizeSegment / DiarizeResponse messages, threaded through interface.go, base, server, client, embed. Default Base impl returns unimplemented. Capability surfaces all updated: FLAG_DIARIZATION usecase, FeatureAudioDiarization permission (default-on), RouteFeatureRegistry entries for /v1/audio/diarization and /audio/diarization, audio instruction-def description widened, CAP_DIARIZATION JS symbol, swagger regenerated, /api/instructions discovery map updated. Tests: * core/backend: speaker-label normalisation (first-seen → SPEAKER_NN, per-speaker totals, nil-safety, fallback to backend NumSpeakers when no segments). * core/http/endpoints/openai: RTTM rendering (file-id basename, negative duration clamping, fallback id). * tests/e2e: mock-backend grew a deterministic Diarize that emits raw labels "5","2","5" so the e2e suite verifies SPEAKER_NN remapping, verbose_json speakers summary + transcript pass-through (gated by include_text), RTTM bytes content-type, and rejection of unknown response_format. mock-diarize model config registered with known_usecases=[FLAG_DIARIZATION] to bypass the backend-name guard. Docs: new features/audio-diarization.md (request/response, RTTM example, sherpa-onnx + vibevoice setup), cross-link from audio-to-text.md, entry in whats-new.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(diarization): correct sherpa-onnx symbol name + lint cleanup CI failures on #9654: * sherpa-onnx-grpc-{tts,transcription} and sherpa-onnx-realtime panicked at backend startup with `undefined symbol: SherpaOnnxDestroyOfflineSpeakerDiarizationResult`. Upstream's actual symbol is SherpaOnnxOfflineSpeakerDiarizationDestroyResult (Destroy in the middle, not the prefix); the rest of the diarization surface follows the same naming pattern. The mismatched name made purego.RegisterLibFunc fail at dlopen time and crashed the gRPC server before the BeforeAll could probe Health, taking down every sherpa-onnx test job — not just the diarization-related ones. * golangci-lint flagged 5 errcheck violations on new defer cleanups (os.RemoveAll / Close / conn.Close); wrap each in a `defer func() { _ = X() }()` closure (matches the pattern other LocalAI files use for new code, since pre-existing bare defers are grandfathered in via new-from-merge-base). * golangci-lint also flagged forbidigo violations: the new diarization_test.go files used testing.T-style `t.Errorf` / `t.Fatalf`, which are forbidden by the project's coding-style policy (.agents/coding-style.md). Convert both files to Ginkgo/Gomega Describe/It with Expect(...) — they get picked up by the existing TestBackend / TestOpenAI suites, no new suite plumbing needed. * modernize linter: tightened the diarization segment loop to `for i := range int(numSegments)` (Go 1.22+ idiom). Verified locally: golangci-lint with new-from-merge-base=origin/master reports 0 issues across all touched packages, and the four mocked diarization e2e specs in tests/e2e/mock_backend_test.go still pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(vibevoice-cpp): convert non-WAV input via ffmpeg + raise ASR token budget Confirmed end-to-end against a real LocalAI instance with vibevoice-asr-q4_k loaded and the multi-speaker MP3 sample at vibevoice.cpp/samples/2p_argument.mp3: both /v1/audio/transcriptions and /v1/audio/diarization now succeed and return correctly attributed speaker turns for the full clip. Two latent issues surfaced once the diarization endpoint actually exercised the backend with a non-trivial input: 1. vv_capi_asr only accepts WAV via load_wav_24k_mono. The previous code passed the uploaded path straight through, so anything that wasn't already a 24 kHz mono s16le WAV failed at the C side with rc=-8 and the very unhelpful "vv_capi_asr failed". prepareWavInput shells out to ffmpeg ("-ar 24000 -ac 1 -acodec pcm_s16le") in a per-call temp dir, matching the rate the model was trained on; both AudioTranscription and Diarize now route through it. This is the same shape sherpa-onnx uses (utils.AudioToWav), but vibevoice needs 24 kHz rather than 16 kHz so we don't reuse that helper. 2. The C ABI's max_new_tokens defaults to 256 when 0 is passed. That's fine for a five-second clip but not for anything past ~10 s — vibevoice stops mid-JSON, the parse fails, and the caller sees a hard error. Pass a much larger budget (16 384 ≈ ~9 minutes of speech at the model's ~30 tok/s rate); generation stops at EOS so this is a cap rather than a target. 3. As a defensive belt-and-braces, mirror AudioTranscription's existing "fall back to a single segment if the model emits non-JSON text" pattern in Diarize, so partial / unusual model output never produces a 500. This kept the endpoint usable while diagnosing (1) and (2), and is the right behaviour to keep. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(vibevoice-cpp): pass valid WAVs through directly so ffmpeg is not required at runtime Spotted by tests-e2e-backend (1.25.x): the previous fix forced every incoming audio file through `ffmpeg -ar 24000 ...`, which meant the backend container — which does not ship ffmpeg — failed even for the existing happy path where the caller already uploads a WAV. The container-side error was: rpc error: code = Unknown desc = vibevoice-cpp: ffmpeg convert to 24k mono wav: exec: "ffmpeg": executable file not found in $PATH Reading vibevoice.cpp's audio_io.cpp, `load_wav_24k_mono` uses drwav and already accepts any PCM/IEEE-float WAV at any sample rate, downmixes multi-channel input to mono, and resamples to 24 kHz internally. So the only inputs that genuinely need an external converter are non-WAV formats (MP3, OGG, FLAC, ...). Detect WAVs by RIFF/WAVE magic at bytes 0..3 / 8..11 and pass them straight through with a no-op cleanup; everything else still goes through ffmpeg with the same 24 kHz mono s16le target. The result: * Container builds without ffmpeg keep working for WAV uploads (the e2e-backends fixture is jfk.wav at 16 kHz mono s16le). * MP3 and other non-WAV inputs still get the new ffmpeg conversion path so the diarization endpoint stays useful. * If the caller uploads a non-WAV but ffmpeg isn't on PATH, the surfaced error is still descriptive enough to act on. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(ci): make gcc-14 install in Dockerfile.golang best-effort for jammy bases The LocalVQE PR (`bb033b16`) made `gcc-14 g++-14` an unconditional apt install in backend/Dockerfile.golang and pointed update-alternatives at them. That works on the default `BASE_IMAGE=ubuntu:24.04` (noble has gcc-14 in main), but every Go backend that builds on `nvcr.io/nvidia/l4t-jetpack:r36.4.0` — jammy under the hood — now fails at the apt step: E: Unable to locate package gcc-14 This blocked unrelated jobs: backend-jobs(*-nvidia-l4t-arm64-{stablediffusion-ggml, sam3-cpp, whisper, acestep-cpp, qwen3-tts-cpp, vibevoice-cpp}). LocalVQE itself is only matrix-built on ubuntu:24.04 (CPU + Vulkan), so it doesn't actually need gcc-14 anywhere else. Make the gcc-14 install conditional on the package being available in the configured apt repos. On noble: identical behaviour to today (gcc-14 installed, update-alternatives points at it). On jammy: skip the gcc-14 stanza entirely and let build-essential's default gcc take over, which is what the other Go backends compile with anyway. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-05 15:10:13 +02:00
Richard Palethorpe	bb033b16a9	feat: add LocalVQE backend and audio transformations UI (#9640 ) feat(audio-transform): add LocalVQE backend, bidi gRPC RPC, Studio UI Introduce a generic "audio transform" capability for any audio-in / audio-out operation (echo cancellation, noise suppression, dereverberation, voice conversion, etc.) and ship LocalVQE as the first backend implementation. Backend protocol: - Two new gRPC RPCs in backend.proto: unary AudioTransform for batch and bidirectional AudioTransformStream for low-latency frame-by-frame use. This is the first bidi stream in the proto; per-frame unary at LocalVQE's 16 ms hop would be RTT-bound. Wire it through pkg/grpc/{client,server, embed,interface,base} with paired-channel ergonomics. LocalVQE backend (backend/go/localvqe/): - Go-Purego wrapper around upstream liblocalvqe.so. CMake builds the upstream shared lib + its libggml-cpu-.so runtime variants directly — no MODULE wrapper needed because LocalVQE handles CPU feature selection internally via GGML_BACKEND_DL. - Sets GGML_NTHREADS from opts.Threads (or runtime.NumCPU()-1) — without it LocalVQE runs single-threaded at ~1× realtime instead of the documented ~9.6×. - Reference-length policy: zero-pad short refs, truncate long ones (the trailing portion can't have leaked into a mic that wasn't recording). - Ginkgo test suite (9 always-on specs + 2 model-gated). HTTP layer: - POST /audio/transformations (alias /audio/transform): multipart batch endpoint, accepts audio + optional reference + params[]=v form fields. Persists inputs alongside the output in GeneratedContentDir/audio so the React UI history can replay past (audio, reference, output) triples. - GET /audio/transformations/stream: WebSocket bidi, 16 ms PCM frames (interleaved stereo mic+ref in, mono out). JSON session.update envelope for config; constants hoisted in core/schema/audio_transform.go. - ffmpeg-based input normalisation to 16 kHz mono s16 WAV via the existing utils.AudioToWav (with passthrough fast-path), so the user can upload any format / rate without seeing the model's strict 16 kHz constraint. - BackendTraceAudioTransform integration so /api/backend-traces and the Traces UI light up with audio_snippet base64 and timing. - Routes registered under routes/localai.go (LocalAI extension; OpenAI has no /audio/transformations endpoint), traced via TraceMiddleware. Auth + capability + importer: - FLAG_AUDIO_TRANSFORM (model_config.go), FeatureAudioTransform (default-on, in APIFeatures), three RouteFeatureRegistry rows. - localvqe added to knownPrefOnlyBackends with modality "audio-transform". - Gallery entry localvqe-v1-1.3m (sha256-pinned, hosted on huggingface.co/LocalAI-io/LocalVQE). React UI: - New /app/transform page surfaced via a dedicated "Enhance" sidebar section (sibling of Tools / Biometrics) — the page is enhancement, not generation, so it lives outside Studio. Two AudioInput components (Upload + Record tabs, drag-drop, mic capture). - Echo-test button: records mic while playing the loaded reference through the speakers — the mic naturally picks up speaker bleed, giving a real (mic, ref) pair for AEC testing without leaving the UI. - Reusable WaveformPlayer (canvas peaks + click-to-seek + audio controls) and useAudioPeaks hook (shared module-scoped AudioContext to avoid hitting browser context limits with three players on one page); migrated TTS, Sound, Traces audio blocks to use it. - Past runs saved in localStorage via useMediaHistory('audio-transform') — the history entry stores all three URLs so clicking re-renders the full triple, not just the output. Build + e2e: - 11 matrix entries removed from .github/workflows/backend.yml (CUDA, ROCm, SYCL, Metal, L4T): upstream supports only CPU + Vulkan, so we ship those two and let GPU-class hardware route through Vulkan in the gallery capabilities map. - tests-localvqe-grpc-transform job in test-extra.yml (gated on detect-changes.outputs.localvqe). - New audio_transform capability + 4 specs in tests/e2e-backends. - Playwright spec suite in core/http/react-ui/e2e/audio-transform.spec.js (8 specs covering tabs, file upload, multipart shape, history, errors). Docs: - New docs/content/features/audio-transform.md covering the (audio, reference) mental model, batch + WebSocket wire formats, LocalVQE param keys, and a YAML config example. Cross-links from text-to-audio and audio-to-text feature pages. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent TaskCreate] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-04 22:07:11 +02:00
Ettore Di Giacinto	b1a99436c7	feat(branding): admin-configurable instance name, tagline, and assets (#9635 ) Adds a whitelabeling feature so an operator can replace the LocalAI instance name, tagline, square logo, horizontal logo, and favicon from the admin Settings page. Defaults fall back to the bundled assets so existing installs are unaffected. The public GET /api/branding endpoint is reachable pre-auth so the login screen can render the configured branding before sign-in. Mutating routes (POST/DELETE /api/branding/asset/:kind) remain admin-only. Text fields (instance_name, instance_tagline) ride the existing /api/settings flow; binary assets get a dedicated multipart upload route that persists files under DynamicConfigsDir/branding/. To prevent the Settings page's stale local state from clobbering an upload on save, UpdateSettingsEndpoint preserves whatever the on-disk asset filename fields are regardless of the body — /api/branding/asset/* are the sole writers for those fields. The MCP catalog gains get_branding and set_branding tools (text fields only; file upload stays UI-only) plus a configure_branding skill prompt. While wiring this up, the same restart-loss class of bug surfaced for several existing fields whose RuntimeSettings entries were never read by the startup loader. Fix loadRuntimeSettingsFromFile() to load: - branding (instance_name, instance_tagline, *_file basenames) - auto_upgrade_backends, prefer_development_backends - localai_assistant_enabled - open_responses_store_ttl - the 7 existing AgentPool fields (enabled, default/embedding model, chunking sizes, enable_logs, collection_db_path) Also exposes 3 new AgentPool runtime settings (vector_engine, database_url, agent_hub_url) via /api/settings + the Settings UI, with the same load-on-startup wiring. The file watcher's manual-edit path is intentionally not changed — the in-process API endpoints already update appConfig directly, so the watcher is redundant for supported flows and a separate refactor for everything else. 15 TDD specs cover the loader behaviour (1 branding + 11 adjacent + 3 new agent-pool); 2 specs cover the persistence helpers and the clobber-prevention contract. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-02 15:51:36 +02:00
Ettore Di Giacinto	bcef72b9c1	feat: localai assistant chat modality (#9602 ) * fix(tests): inline model_test fixtures after tests/models_fixtures removal The previous reorg removed tests/models_fixtures/ but core/config/model_test.go still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so `make test` failed with "open : no such file or directory" on the readConfigFile spec (the suite ran with --fail-fast and bailed before openresponses_test). Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test file, materialise them into a per-test tmpdir via BeforeEach, and drop the env-var lookups. The test no longer depends on Makefile plumbing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash] * refactor(modeladmin): extract model-admin helpers into a service package Lift the bodies of EditModelEndpoint, PatchConfigEndpoint, ToggleStateModelEndpoint, TogglePinnedModelEndpoint and VRAMEstimateEndpoint into core/services/modeladmin so the same logic can be called by non-HTTP clients (notably the in-process MCP server that backs the LocalAI Assistant chat modality, landing in a follow-up commit). The HTTP handlers shrink to thin shells that parse echo inputs, call the matching helper, map typed errors (ErrNotFound, ErrConflict, ErrPathNotTrusted, ErrBadAction, ...) to the existing HTTP status codes, and render the existing response shapes. No REST-surface behaviour change; the existing localai endpoint tests cover the regression net. Adds focused unit tests for each helper against tmp-dir-backed ModelConfigLoader fixtures (deep-merge patch, rename + conflict, path separator guard, toggle/pin enable/disable, sync callback). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(assistant): LocalAI Assistant chat modality with in-memory MCP server Adds a chat modality, admin-only, that wires the chat session to an in-memory MCP server exposing LocalAI's own admin/management surface as tools. An admin can install models, manage backends, edit configs and check status by chatting; the LLM calls tools like gallery_search, install_model, import_model_uri, list_installed_models, edit_model_config and surfaces the results. Same Go package powers two modes: pkg/mcp/localaitools/ NewServer(client, opts) builds an MCP server that registers the 19-tool admin catalog. The LocalAIClient interface has two impls: - inproc.Client — calls services directly (no HTTP loopback, no synthetic admin API key). Used in-process by the chat handler. - httpapi.Client — calls the LocalAI REST API. Used by the new `local-ai mcp-server --target=…` subcommand to control a remote LocalAI from a stdio MCP host. Tools and their embedded skill prompts are agnostic to which client backs them. Skill prompts are markdown files under prompts/, embedded via go:embed and assembled into the system prompt at server init. Wiring: - core/http/endpoints/mcp/localai_assistant.go — process-wide holder that spins up the in-memory MCP server once at Application start using paired net.Pipe transports, then reuses LocalToolExecutor (no fork) for every chat request that opts in. - core/http/endpoints/openai/chat.go — small branch ahead of the existing MCP block: when metadata.localai_assistant=true, defense-in-depth admin check + executor swap + system-prompt injection. All downstream tool dispatch is unchanged. - core/http/auth/{permissions,features}.go — adds FeatureLocalAIAssistant; gating happens at the chat handler entry plus admin-only `/api/settings`. - core/cli/{run.go,cli.go,mcp_server.go} — LOCALAI_DISABLE_ASSISTANT flag (runtime-toggleable via Settings, no restart), plus `local-ai mcp-server` stdio subcommand. - core/config/runtime_settings.go — `localai_assistant_enabled` runtime setting; the chat handler reads `DisableLocalAIAssistant` live at request entry. UI: - Home.jsx — prominent self-explanatory CTA card on first run ("Manage LocalAI by chatting"); collapses to a compact "Manage by chat" button in the quick-links row once used, persisted via localStorage. - Chat.jsx — admin-only "Manage" toggle in the chat header, "Manage mode" badge, dedicated empty-state copy, starter chips. - Settings.jsx — "LocalAI Assistant" section with the runtime enable toggle. - useChat.js — `localaiAssistant` flag on the chat schema; injects `metadata.localai_assistant=true` on requests when active. Distributed mode: the in-memory MCP server lives only on the head node; inproc.Client wraps already-distributed-aware services so installs propagate to workers via the existing GalleryService machinery. Documentation: `.agents/localai-assistant-mcp.md` is the contributor contract — when adding an admin REST endpoint, also add a LocalAIClient method, an inproc + httpapi impl, a tool registration, and a skill prompt update; the AGENTS.md index links to it. Out of scope (follow-ups): per-tool RBAC granularity for non-admin read-only access; streaming mcp_tool_progress for long installs; React Vitest rig for the UI changes. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): extract tool/capability/MiB/server-name constants The MCP tool surface, capability tag set, server-name default, and the chat-handler metadata key were repeated as bare string literals across seven files. Renaming any one required hand-editing every call site and risked code/test/prompt drift. This pulls them into typed constants: - pkg/mcp/localaitools/tools.go — Tool* constants for the 19 MCP tools, plus DefaultServerName. - pkg/mcp/localaitools/capability.go — typed Capability + constants for the capability tag set the LLM passes to list_installed_models. The type rides through LocalAIClient.ListInstalledModels and replaces the triplet of "embed"/"embedding"/"embeddings" with the single CapabilityEmbeddings. - pkg/mcp/localaitools/inproc/client.go — bytesPerMiB constant for the VRAMEstimate byte→MB conversion. - core/http/endpoints/mcp/tools.go — MetadataKeyLocalAIAssistant for the "localai_assistant" request-metadata key consumed by the chat handler. Tool registrations, the test catalog, the dispatch table, the validation fixtures, and the fake/stub clients all reference the constants. The embedded skill prompts under prompts/ keep their bare strings (go:embed markdown can't import Go constants); the existing TestPromptsContain SafetyAnchors guards the alignment. No behaviour change. All tests pass with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(modeladmin): typed Action for ToggleState/TogglePinned The toggle/pin verbs were bare strings everywhere — handler signatures, service implementations, MCP tool args, the fake/stub clients, the inproc and httpapi LocalAIClient impls, plus 4 test files. A typo in any caller silently fell through to the runtime "must be 'enable' or 'disable'" check. Introduce core/services/modeladmin.Action (string alias) with ActionEnable, ActionDisable, ActionPin, ActionUnpin and a small Valid helper. The compiler now catches mismatches at every boundary; renames ripple through one source of truth. LocalAIClient.ToggleModelState/Pinned signatures change to take modeladmin.Action. The package is brand-new and unreleased so this is a free public-API tightening. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): respect ctx cancellation on gallery channel sends InstallModel, DeleteModel, ImportModelURI, InstallBackend and UpgradeBackend all pushed onto galleryop channels with bare sends. If the worker was paused or the buffer full, the chat-handler goroutine blocked forever — the LLM kept polling and the request leaked. Wrap the five sends in a sendModelOp/sendBackendOp helper that selects on ctx.Done() so a cancelled chat completion surfaces context.Canceled back to the LLM instead of hanging. Adds inproc/client_test.go with a pre-cancelled-ctx regression test on InstallModel; the helpers are shared so the same guarantee covers the other four call sites. Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): graceful shutdown for in-memory holder and stdio CLI Two related leaks: - Application.start() built the LocalAIAssistantHolder but never wired Close() into the graceful-termination chain — the in-memory MCP transport pair stayed alive until process exit, and the goroutines behind net.Pipe() didn't drain. Hook into the existing signals.RegisterGracefulTerminationHandler chain (same pattern as core/http/endpoints/mcp/tools.go:770). - core/cli/mcp_server.go ran srv.Run with context.Background(); a Ctrl-C from the host (Claude Desktop, mcphost, npx inspector) or a SIGTERM from process supervision left the stdio loop reading from a closed pipe. Switch to signal.NotifyContext to surface the signal through ctx and let srv.Run drain. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): typed HTTPError + propagate prompt walk error The httpapi client detected "no such job" by substring-matching on the error string ("404", "could not find") — brittle to status-code formatting changes and to LocalAI fixing /models/jobs/:uuid to return a proper 404. Replace with a typed HTTPError whose Is() method honours errors.Is(err, ErrHTTPNotFound). The 500-with-"could not find" branch stays as a transitional fallback documented in Is(). Same change covers ListNodes' 404 fallback for the /api/nodes endpoint. Adds httptest tests for both 404 and the legacy 500 path, plus a direct errors.Is exposure test so external callers (the standalone stdio CLI host) can match without re-string-parsing. Also tightens prompts.SystemPrompt: panic when fs.WalkDir on the embedded FS fails. The only realistic cause is a build-time //go:embed misconfiguration; serving an empty system prompt to the LLM is much worse than crashing init. TestSystemPromptIncludesAllEmbeddedFiles catches regressions in CI. Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(modeladmin): atomic writes for model config files The five sites that wrote model YAML used os.WriteFile, which opens with O_TRUNC\|O_WRONLY\|O_CREATE. A crash mid-write left the destination truncated and the model unloadable until manual repair. Pre-existing behaviour inherited from the original endpoint handlers — fix once now that there's a single helper. Adds writeFileAtomic: writes to a sibling temp file, chmods, syncs via Close(), then os.Rename. Same-directory temp keeps the rename atomic on the same filesystem; cleanup runs on every error path so stray temps don't accumulate. No new dependency. Applied to: - ConfigService.PatchConfig - ConfigService.EditYAML (both rename and in-place branches) - mutateYAMLBoolFlag (drives ToggleState + TogglePinned) atomic_test.go covers the happy path plus a read-only-dir failure case that asserts the original file is preserved (skipped on Windows where the chmod trick is POSIX-specific). Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(assistant): prune dead code, mark stub, document conventions Three small cleanups landing together: - Drop the unused errNotImplemented sentinel from inproc/client.go. All five methods that used to return it are wired to modeladmin helpers since the Phase B commit; the package var is dead. - Annotate httpapi.Client.GetModelConfig as a known stub. LocalAI's /models/edit/:name returns rendered HTML, not JSON, so the standalone CLI's get_model_config tool surfaces a clear error to the LLM. A future JSON-only /api/models/config-yaml/:name endpoint is tracked in the agent contract; FIXME points at it. - Extend `.agents/localai-assistant-mcp.md` with a "Code conventions" section that documents the audit-driven rules: tool/Capability/Action constants, errors.Is over substring matching, ctx-aware channel sends, atomic writes, and graceful shutdown. Refresh the file map so it lists tools.go and capability.go and drops the removed tools_bootstrap.go. The tools_models.go diff is a comment-only change explaining why the ModelName empty-string check stays at the tool layer (consistency across LocalAIClient implementations, since the SDK schema validator only enforces presence, not non-empty). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(assistant): convert test files to ginkgo + gomega The repo convention (per core/http/endpoints/localai/_test.go, core/gallery/, etc.) is Ginkgo v2 with Gomega assertions. The tests I introduced for the assistant feature used vanilla testing.T, which made them stand out and stripped the BDD structure the rest of the suite relies on. Convert every test file in the assistant scope to Ginkgo: pkg/mcp/localaitools/ dto_test.go — Describe("DTOs round-trip through JSON") prompts_test.go — Describe("SystemPrompt assembler") server_test.go — Describe("Server tool catalog"), Describe("Tool dispatch"), Describe("Tool error surfacing"), Describe("Argument validation"), Describe("Concurrent tool calls") parity_test.go — Describe("LocalAIClient parity"), hosts the suite's single RunSpecs (the file is package localaitools_test so it can import httpapi without an import cycle; Ginkgo aggregates Describes from both the internal and external test packages into one run). httpapi/client_test.go — Describe("httpapi.Client against the LocalAI admin REST surface"), Describe("ErrHTTPNotFound"), Describe("Bearer token") inproc/client_test.go — Describe("inproc.Client cancellation") core/services/modeladmin/ config_test.go — Describe("ConfigService") with sub-Describes for GetConfig, PatchConfig, EditYAML state_test.go — Describe("ConfigService.ToggleState") pinned_test.go — Describe("ConfigService.TogglePinned") atomic_test.go — Describe("writeFileAtomic") core/http/endpoints/mcp/ localai_assistant_test.go — Describe("LocalAIAssistantHolder") Each package gets a `_suite_test.go` with the standard `RegisterFailHandler(Fail) + RunSpecs(t, "...")` boilerplate. Helpers that previously took testing.T (newTestService, writeModelYAML, readMap, sortedStrings, sortGalleries, etc.) drop the T receiver and use Gomega Expectations directly. tmp dirs come from GinkgoT().TempDir(). No semantic change to test coverage — every original assertion has a direct Gomega counterpart. All suites pass with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test+docs(assistant): drift detector for Tool ↔ REST route mapping Honest gap from the audit: the parity_test.go suite only checks four methods, and uses the same httpapi.Client for both sides — it asserts stability of the DTO shapes, not equivalence between in-process and HTTP. If a contributor adds an admin REST endpoint without an MCP tool, or a tool without a matching httpapi route, both surfaces silently diverge. Add a coverage test plus stronger docs: - pkg/mcp/localaitools/coverage_test.go introduces a hand-maintained toolToHTTPRoute map: every Tool* constant must list the REST endpoint the httpapi.Client hits (or "(none)" with a documented reason). Two Ginkgo specs assert the map and the published catalog stay in sync — one fails when a Tool is added without a route entry, the other fails when a route entry references a tool that no longer exists. Verified by removing the ToolDeleteModel entry locally; the test fired with a clear message pointing the contributor at the file. Deliberate non-test: we don't enumerate live admin REST routes from here. Walking the route registry requires booting Application; parsing core/http/routes/localai.go is brittle. The "new admin REST endpoint → MCP tool" direction stays a PR checklist item — see below. - AGENTS.md gets a new Quick Reference bullet that calls out the rule and points at the test by name. - .agents/api-endpoints-and-auth.md tightens the existing "Companion: MCP admin tool surface" subsection from "if useful, consider..." to "MUST be considered, with three concrete outcomes (tool added, deliberately skipped with documented reason, or forgot — which breaks the contract)". Adds a checklist item at the bottom of the file's authoritative checklist. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): drop duplicate DTOs, surface canonical types Audit feedback: localaitools/dto.go reinvented several types that already existed in the codebase. Replace the duplicates with the canonical types so the LLM-visible wire format stays aligned with the rest of LocalAI by construction (no parallel structs to keep in sync). Removed (and the canonical type now used by the LocalAIClient interface): localaitools.Gallery → config.Gallery localaitools.GalleryModelHit → gallery.Metadata localaitools.VRAMEstimate → vram.EstimateResult Tightened scope: localaitools.Backend → kept, but reduced to {Name, Installed}. ListKnownBackends now returns []schema.KnownBackend (the canonical type already used by REST /backends/known). Kept with documented rationale: localaitools.JobStatus — galleryop.OpStatus has Error error which marshals to "{}". JobStatus is the JSON-friendly mirror. localaitools.Node — nodes.BackendNode carries gorm internals + token hash; we expose only the LLM-relevant fields. ImportModelURIRequest/Response — schema.ImportModelRequest and GalleryResponse are wire-shaped, mine are LLM-shaped (BackendPreference flat, AmbiguousBackend exposed). Side wins: - Drop bytesPerMiB; vram.EstimateResult already carries human-readable display strings (size_display, vram_display) the LLM uses directly. - Drop the handler-private vramEstimateRequest in core/http/endpoints/localai/vram.go and bind directly into modeladmin.VRAMRequest (now JSON-tagged). Both clients pass through these types now where possible (e.g. ListGalleries in inproc.Client is a one-liner returning AppConfig.Galleries; httpapi.Client.GallerySearch decodes straight into []gallery.Metadata). All tests green with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): extract REST route paths into named constants httpapi.Client had 18 bare-string path sites scattered across methods. Pull them into pkg/mcp/localaitools/httpapi/routes.go: static paths as package-private constants, dynamic paths as small builders that handle url.PathEscape on segment values. No behaviour change. Drops the now-unused net/url import from client.go since path escaping moved into routes.go alongside the path it applies to. Local-only by design: the server-side registrations in core/http/routes/localai.go remain bare strings. Sharing constants across the pkg/ ↔ core/ boundary would invert the layering today; the existing Tool↔REST drift-detector in coverage_test.go is the safety net for that direction. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * docs(assistant): align with shipped UI and dropped bootstrap env vars The LocalAI Assistant doc still described the older iteration: - The in-chat toggle was renamed from "Admin" to "Manage" (the badge is now "Manage mode" and the home page exposes a "Manage by chat" CTA). - LOCALAI_ASSISTANT_BOOTSTRAP_MODEL / --localai-assistant-bootstrap-model and the bootstrap_default_model tool were removed — admins pick a model from the existing selector instead, no env-var configuration required. - The shipped tool catalog includes import_model_uri but didn't appear in the doc; bootstrap_default_model appeared but no longer exists. - The Settings → LocalAI Assistant runtime toggle wasn't mentioned as the preferred way to disable without restart. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-28 19:29:27 +02:00
Ettore Di Giacinto	ea1df8945b	fix(distributed): preserve UI-added node labels across worker re-register The register endpoint called SetNodeLabels(req.Labels) — replace-all semantics — so every worker re-register wiped every label not in the worker's body. The bug existed since labels were introduced in PR #9186 (Mar 31), but only triggered for workers that supplied labels via --node-labels. PR #9583 (the multi-replica refactor) added an auto-mirrored `node.replica-slots` label to every worker's registration body, which made `len(req.Labels) > 0` always true — turning a latent edge-case bug into a universal one. Operators reported "labels assigned to node do not persist": labels survived until the next worker restart, then disappeared. Fix: iterate req.Labels and call SetNodeLabel (upsert) for each instead of SetNodeLabels (delete-then-recreate). Worker-managed labels still refresh on re-register; UI-added labels survive. Trade-off: an operator who removes a label from --node-labels won't have it auto-removed from the DB on next register — they can clean it via the UI. Acceptable, since the alternative (current behavior) silently destroys operator state. Regression test added first (TDD): RegisterNodeEndpoint registers a node, the test simulates a UI add via SetNodeLabel, then re-registers with a different worker label set; assertion that the UI-added label survives. Test fails against the broken code, passes against the fix. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash]	2026-04-27 21:24:50 +00:00
Ettore Di Giacinto	6b63b47f61	feat(distributed): support multiple replicas of one model on the same node (#9583 ) * feat(distributed): support multiple replicas of one model on the same node The distributed scheduler implicitly assumed `(node_id, model_name)` was unique, but the schema didn't enforce it and the worker keyed all gRPC processes by model name alone. With `MinReplicas=2` against a single worker, the reconciler "scaled up" every 30s but the registry never advanced past 1 row — the worker re-loaded the model in-place every tick until VRAM fragmented and the gRPC process died. This change introduces multi-replica-per-node as a first-class concept, with capacity-aware scheduling, a circuit breaker, and VRAM soft-reservation. Operators can declare per-node capacity via the worker flag `--max-replicas-per-model` (mirrored as auto-label `node.replica-slots=N`) or override per-node from the UI. * Schema: BackendNode gains MaxReplicasPerModel (default 1) and ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id + model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for the reconciler circuit breaker. * Registry: replica_index threaded through SetNodeModel, RemoveNodeModel, IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel, SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers: CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot), RemoveAllNodeModelReplicas, FindNodesWithFreeSlot, ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with ErrInsufficientVRAM), and the unsatisfiable-flag CRUD. * Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads of the same model land on distinct ports. Adds CLI flag --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1) and emits the auto-label. * Router: scheduleNewModel filters candidates by free slot, allocates the replica index, and soft-reserves VRAM before installing the backend. evictLRUAndFreeNode now deletes the targeted row by ID instead of all replicas of the model on the node — fixes a latent bug where evicting one replica orphaned its siblings. * Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig (MinReplicas > capacity) doesn't loop forever. After 3 consecutive ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and emits a warning. ClearAllUnsatisfiable fires from Register, ApproveNode, SetNodeLabel(s), RemoveNodeLabel and UpdateMaxReplicasPerModel so a new node joining or label changes wake the reconciler immediately. scaleDownIdle removes highest-replica-index first to keep slots compact. * Heartbeat resets reserved_vram to 0 — worker is the source of truth for actual free VRAM; the reservation is only for the in-tick race window between two scheduling decisions. * Probe path (reconciler.probeLoadedModels and health.doCheckAll) now pass the row's replica_index to RemoveNodeModel so an unreachable replica doesn't orphan healthy siblings. * Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a sticky override (preserved across worker re-registration). DELETE clears the override so the worker's flag applies again on next register. Required because Kong defaults the worker flag to 1, so every worker restart would have silently reverted the UI value. * React UI: always-visible slot badge on the node row (muted at default 1, accented when >1); inline editor in the expanded drawer with pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when the value is admin-set, and a "Reset" button to hand control back to the worker. Soft confirm when shrinking the cap below the count of loaded replicas. Scheduling rules table gets an "Unsatisfiable until HH:MM" status badge surfacing the cooldown. * node.replica-slots filtered out of the labels strip on the row to avoid duplicating the slot badge. 23 new Ginkgo specs (registry, reconciler, inflight, health) cover: multi-replica row independence, RemoveNodeModel of one replica preserving siblings, NextFreeReplicaIndex slot allocation including ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping and recovery on Register, scheduleDownIdle ordering, ClusterCapacity math, ReserveVRAM admission gating, Heartbeat reset, override survival across worker re-registration, and ResetMaxReplicasPerModel handing control back. Plus 8 stdlib tests for the worker processKey / CLI / auto-label. Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor worker (single 128 GiB node, MinReplicas=2): the reconciler now caps the scale-up at the cluster's actual capacity instead of looping. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing] * refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions * Capacity editor hint trimmed from operator-doc-style ("Sourced from the worker's `--max-replicas-per-model` flag. Changing it here makes it a sticky admin override that survives worker restarts." → "Saved values stick across worker restarts.") and the override-state copy similarly compressed. The full mechanic is no longer needed in the UI — the override pill carries the meaning and the docs cover the rest. * Node row actions migrated from an inline cluster of icon buttons (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for per-row model actions, so dense Nodes tables stay clean. Approve stays as a prominent primary button — it's a stateful admission gate, not a routine action, and elevating it matches how /manage surfaces install-time decisions outside the menu. * The expanded drawer's Labels section now filters node.replica-slots out of the editable label list. The label is owned by the Capacity editor above; surfacing it again as an editable label invited confusion (the Capacity save would clobber any direct edit). Both backend and agent workers benefit — they share the row rendering path, so the action menu and label filter apply to both. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish] * fix(react-ui/nodes): suppress slot badge on agent workers Agent workers don't load models, so the per-node replica capacity is inapplicable to them. Showing "1× slots" on agent rows was a tiny inconsistency from the unified rendering path — gate the badge on node_type !== 'agent' so it only appears on backend workers. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] * refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form The expanded node drawer used to stack five panels — slot badge, filled capacity box, Loaded Models h4+empty-state, Installed Backends h4+empty-state, Labels h4+chips+form — making routine inspections feel like a control panel. The scheduling rule form wrapped its mode toggle as two 50%-width filled buttons that competed visually with the actual primary action. * Drawer: collapse three rarely-touched config zones (Capacity, Backends, Labels) into one `<details>` "Manage" disclosure (closed by default) with small uppercase eyebrow labels for each zone instead of parallel h4 sub-headings. Loaded Models stays as the at-a-glance headline with a single-line empty hint instead of a boxed empty state. CapacityEditor renders flat (no filled background) — the Manage disclosure provides framing. * Scheduling form: replace the chunky 50%-width button-tabs with the project's existing `.segmented` control (icon + label, sized to content). Mode hint becomes a single tied line below. Fields stack vertically with helper text under inputs and a hairline divider above the right-aligned Save / Cancel. The empty drawer collapses from ~5 stacked sections (~280px tall) to two lines (~80px). The scheduling form now reads as a designed dialog instead of raw building blocks. Both surfaces now match the typographic density and weight of the rest of the admin pages. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish] * feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox The native <select> made operators scroll through every gallery entry to find a model name. The project already has SearchableModelSelect (used in Studio/Talk/etc.) which combines free-text search with the gallery list and accepts typed model names that aren't installed yet — useful for pre-staging a scheduling rule before the node it'll run on has finished bootstrapping. Also drops the now-unused useModels import (the combobox manages the gallery hook internally). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] * refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips The Nodes page was rendering the same key=value chip pattern in two places with subtly different markup: the Labels editor in the expanded drawer and (post-distill) the Node Selector input in the scheduling form. The form's input was also a comma-separated string that operators were getting wrong. * Extract <KeyValueChips> as a fully controlled chip-builder. Parent owns the map and decides what onAdd/onRemove does — form state for the scheduling form, API calls for the live drawer Labels editor. Same visuals everywhere; one component to change when polish needs apply. * Replace the comma-separated Node Selector text input with KeyValueChips. Operators were copying syntax from docs and missing commas; the chip vocabulary makes the key=value structure self-documenting. * Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max replicas. Picked over a slider because replica counts are exact specs derived from VRAM math (operator decision, not a fuzzy estimate). The chips give one-click access to common values (1/2/3/4 for Min, 0=no-limit/2/4/8 for Max) without the slider's special-value problem (MaxReplicas=0 is categorical, not a position on a continuum). * Drop the now-unused labelInputs state in the Nodes page (the inline label editor's per-node draft state lived there and is now owned by KeyValueChips). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill] * test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright) Two breakages caught by CI that didn't surface in the local run: * tests/e2e/distributed/.go — multiple files used the pre-PR2 registry signatures for SetNodeModel / IncrementInFlight / DecrementInFlight / RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo and one stale adapter.InstallBackend call in node_lifecycle_test.go. All updated to pass replicaIndex=0 — these tests don't exercise multi-replica behavior, they just need to compile against the new signatures. The chip-builder tests in core/services/nodes/ already cover the multi-replica logic. core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the drawer's distill refactor moved Backends inside a "Manage" <details> disclosure that's collapsed by default. The test helper expanded the node row but never opened Manage, so the per-node backend table was never in the DOM. Helper now clicks `.node-manage > summary` after expanding the row. All 100 playwright tests pass locally; tests/e2e/distributed compiles clean. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-27 21:20:05 +02:00
Ettore Di Giacinto	2da1a4d230	feat(distributed): per-node backend installation from the gallery In distributed mode the Backends gallery used to fan every install out to every worker — fine for auto-resolving (meta) backends like llama-cpp where each node picks its own variant, but wrong for hardware-specific builds like cpu-llama-cpp that would silently land on every GPU node. Adds a node-targeted install path through the existing POST /api/nodes/:id/backends/install plumbing, with two entry points: - Backends gallery row gets a split-button in distributed mode. Auto- resolving keeps "Install on all nodes" as the primary; chevron menu opens the picker. Hardware-specific routes the primary directly to the picker — no fan-out path on the row. - Nodes-page drawer gets a "+ Add backend" button that navigates to /app/backends?target=<node-id>; the gallery scopes itself to that node (banner, single per-row install button, Reinstall/Remove for already- installed). One gallery, two scopes — no second UI to maintain. The picker (new NodeInstallPicker) shows a 3-state suitability column (Compatible / Override / Installed), an auto-expanding variant override disclosure that fires when selected nodes have no working GPU, parallel per-node installs with inline status and Retry-failed-nodes, and a mismatch confirm that names the consequence on the button itself. A 409 fan-out guard on /api/backends/apply protects CLI/Terraform/script users from the same footgun: hardware-specific installs in distributed mode now return code "concrete_backend_requires_target" with a human- readable error and a meta_alternative pointer. The gallery list payload now surfaces capabilities, metaBackendFor and per-row nodes (NodeBackendRef) so the picker and the new Nodes column have everything they need without re-walking the gallery client-side. GODEBUG=netdns=go is set on the compose services because the cgo DNS resolver follows the container's nsswitch.conf to host systemd-resolved (127.0.0.53), unreachable from inside the container; the pure-Go resolver reads /etc/resolv.conf directly and uses Docker's embedded DNS. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude Code:claude-opus-4-7[1m] [Edit] [Bash] [Read] [Write]	2026-04-26 22:05:18 +00:00
Richard Palethorpe	3db60b57e6	fix(realtime): consume ChatDeltas when C++ autoparser clears Response (#9538 ) The llama.cpp C++-side chat autoparser clears Reply.Message and delivers parsed content/reasoning/tool-calls via Reply.chat_deltas. chat.go handles this (non-SSE path uses ToolCallsFromChatDeltas/ContentFromChatDeltas/ ReasoningFromChatDeltas), but realtime.go only read pred.Response, so any model routed through the autoparser (Qwen2.5/3 and friends) produced a silent reply: backend emitted N tokens, the session surface saw zero. Mirror the non-SSE chat path in realtime's triggerResponse: when deltas carry tool calls or content, use them directly; otherwise fall back to the existing raw-text parsing. Assisted-by: claude-opus-4-7-1M [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-24 14:41:38 +02:00

1 2 3 4 5 ...

310 Commits