Commit Graph

582 Commits

Author SHA1 Message Date
Ettore Di Giacinto
d05d83ff36 feat(realtime): stream tool-call turns via tokenizer-template autoparser
Per review (richiejp): tool-call deltas exist, so streaming should work with
tools too. It does — for models that use their tokenizer template. The C++
autoparser then clears reply.Message and delivers content + tool calls via
ChatDeltas, so the streamed transcript carries only spoken content (no
tool-call JSON leak) and the tool calls are parsed from the final response.

- Drop the len(tools)==0 gate; stream when no tools OR use_tokenizer_template
  (grammar-based function calling still buffers, since its call is emitted as
  JSON in the token stream and would leak into the transcript).
- streamLLMResponse takes tools/toolChoice/toolTurn, reads ChatDelta content
  in the token callback, parses tool calls from the final ChatDeltas, and
  creates the assistant content item lazily so a content-less tool turn emits
  only the tool calls.
- Extract emitToolCallItems from the buffered path so both paths finalize tool
  calls, response.done, and server-side assistant-tool follow-ups identically.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
076dcdbed8 refactor(realtime): buffer whole message for TTS, drop sentence segmenter
Per review (richiejp): the sentence segmenter pipelined unary TTS by
splitting on ASCII .!?/newline, which does nothing for languages without
those boundaries (CJK/Thai) — there it already degraded to buffering the
whole message anyway.

Replace it with a uniform model: stream the LLM transcript live, buffer the
full message, then synthesize it once. emitSpeech already streams the audio
chunks when the backend implements TTSStream and falls back to a single
unary delta otherwise, so this is real streaming TTS where supported and a
clean whole-message synthesis elsewhere — no per-sentence emulation, no
language assumptions. speechStreamer becomes transcriptStreamer (transcript
deltas only); the whole-message synthesis moves into streamLLMResponse.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
9ec1456ec6 fix(realtime): clean TTS temp path before read (gosec G304)
emitSpeech reads the WAV file the TTS backend wrote. The read moved here
from realtime.go, so code-scanning flagged it as a new G304 alert even
though the path is backend-controlled (a temp file), not user input.
Wrap it in filepath.Clean — a real path normalization that also clears
the alert, keeping with the repo's no-#nosec convention.

Assisted-by: Claude:claude-opus-4-8 gosec, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
cb3609530a fix(realtime): always strip reasoning from spoken output
disable_thinking maps to ReasoningConfig.DisableReasoning=true on the LLM
config, which the backend reads as enable_thinking=false. But the realtime
handler reads that SAME config to drive reasoning extraction, and there
DisableReasoning=true means "skip stripping". PredictConfig() returns this
LLM config, so both the streamed (speechStreamer) and buffered realtime
paths stopped stripping <think>…</think> exactly when disable_thinking was
on — leaking raw reasoning to the client whenever the model ignored the
enable_thinking hint (e.g. lfm2.5).

Add spokenReasoningConfig() which clears DisableReasoning for extraction
(keeping custom tokens/tag pairs) and route both realtime paths through it.
Spoken output now always strips reasoning, independent of the backend
suppression hint.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
f48344f2ff fix(realtime): register pipeline streaming/thinking config fields
TestAllFieldsHaveRegistryEntries (core/config/meta) requires every config
field to have a meta registry entry. The four new pipeline fields
(disable_thinking, streaming.{llm,tts,transcription}) had none, failing
tests-linux/tests-apple. Add toggle entries for them.

Also handle the os.Remove return in realtime_speech_test.go to satisfy
errcheck (golangci-lint).

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
16a5bab71f feat(realtime): wire streamLLMResponse for token-streamed replies
triggerResponseAtTurn takes a streamed path when pipeline.streaming.llm is
set, the turn has no tools, and audio is requested: streamLLMResponse
announces the assistant item, drives the LLM token callback through a
speechStreamer (reasoning-stripped transcript deltas + sentence-piped TTS),
and emits the terminal events. Tool turns and non-streaming pipelines keep
the existing buffered path unchanged, so this is strictly opt-in.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
ca23d05c66 feat(realtime): speechStreamer for token-streamed LLM->TTS
emitSpeech now returns raw PCM (caller base64-encodes) so streamed segments
accumulate correctly. speechStreamer consumes streamed LLM tokens: it strips
reasoning via the streaming ReasoningExtractor, emits a transcript delta per
content fragment, and sentence-pipes content into emitSpeech so each sentence
is synthesized as soon as it's ready. Handler wiring (plain-content turns)
follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
685e4632d7 feat(realtime): pipeline disable_thinking maps to enable_thinking off
applyPipelineThinking forces the LLM's ReasoningConfig.DisableReasoning when
pipeline.disable_thinking is set, which gRPCPredictOpts turns into the
enable_thinking=false backend metadata. Applied at newModel construction on
the per-session LLM config copy, so it doesn't leak to other model users and
needs no realtime-specific request plumbing.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
98ed541b22 feat(realtime): streaming transcription text deltas
Add emitTranscription and route commitUtterance through it. With
pipeline.streaming.transcription set it streams each transcript fragment as
a conversation.item.input_audio_transcription.delta via TranscribeStream
then a completed event; otherwise it preserves the single completed-event
unary behaviour. Returns the final transcript for response generation.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
378d6c25cf feat(realtime): route response audio through emitSpeech (streaming TTS)
Replace the inline unary TTS block in the response handler with emitSpeech,
which streams a response.output_audio.delta per backend PCM chunk when
pipeline.streaming.tts is set and otherwise preserves the single-delta unary
behaviour. emitSpeech returns the accumulated base64 audio, stored on the
conversation item as before. Transcript and audio-done events stay in the
handler so later per-segment streaming can reuse emitSpeech.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
2c6fdd0570 feat(realtime): emitSpeech with flag-gated streaming TTS
emitSpeech synthesizes a piece of text and forwards audio to the client,
streaming one output_audio.delta per backend PCM chunk when the pipeline
sets streaming.tts, or one delta for the whole utterance otherwise. WebRTC
gets raw PCM (it resamples internally); WebSocket gets base64 PCM at the
session rate. It emits no transcript/audio-done events so a streamed reply
can be split into multiple spoken segments sharing one response.

Adds fakeModel/fakeTransport test doubles for the realtime Model/Transport
interfaces, driving streaming assertions deterministically.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
2ba2216ce2 feat(realtime): streaming TTS/transcription methods on Model interface
Add TTSStream and TranscribeStream to the realtime Model interface and
implement them on wrappedModel (delegating to backend.ModelTTSStream /
ModelTranscriptionStream) and transcriptOnlyModel. ttsStream adapts the
backend's WAV-framed stream (44-byte header carrying the sample rate, then
PCM) into raw PCM + sample rate for the realtime transports. Handler wiring
that consumes these (flag-gated) follows.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
Ettore Di Giacinto
e0820a11c9 feat(realtime): sentence segmenter for streamed LLM->TTS pipelining
streamSegmenter accumulates streamed LLM tokens and emits complete
sentence/clause segments (terminator+whitespace, or newline) so TTS can
synthesize each segment as it completes instead of waiting for the whole
reply. Pure helper; the streaming handler wiring consumes it next.

Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 14:03:36 +00:00
LocalAI [bot]
e837921c2c feat: forward reasoning_effort to the backend so jinja models honor it (#10184)
* feat: forward reasoning_effort to the backend so jinja models honor it

reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.

Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(realtime): set the pipeline LLM's reasoning_effort

Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.

Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-05 13:45:43 +00:00
LocalAI [bot]
27e63b9a78 feat(tts): support per-request instructions and params (#10172)
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.

Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:

- proto: add `optional string instructions` and `map<string,string> params`
  to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
  extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
  that attaches instructions only when non-empty (so backends can fall back to
  YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
  (used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
  options into generate() kwargs.

Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.

Closes #10164


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-04 11:45:02 +02:00
Richard Palethorpe
3a932a9803 feat(distributed): Add NATS JWT authentication and TLS/mTLS options (#10159)
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage

Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.

Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.

Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(distributed): correct NATS JWT scoping and harden client auth

The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:

- Agent-worker minted JWTs did not allow the subjects the agent worker
  actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
  so MCP-CI jobs and backend-stop session cleanup were silently dropped.
  Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
  returned a live-but-dead subscription). Confirm subscriptions with a
  server round-trip so a denial surfaces synchronously, and log async
  permission errors.
- The backend worker connected anonymously when given a JWT without its
  paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
  prefixcache.>, which the frontend publishes and subscribes; add it.

Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.

Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(distributed): acquire and auto-refresh worker NATS credentials

Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.

Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):

- Acquire re-registers through admin approval until the node is approved
  and credentials are minted (or returns the first success when auth is
  not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
  updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
  return an error on exhaustion, so an unapprovable or unrenewable worker
  exits non-zero and surfaces the problem instead of hanging or drifting
  toward an expired credential.

The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.

Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.

Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-03 19:43:56 +02:00
Richard Palethorpe
5a0013defe test(react-ui): add page render-smoke specs, reset the coverage gate (#10122)
The UI coverage gate was tightened to 0.1pp against a fast-local
measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower,
so tests-ui-e2e failed there. UI e2e coverage is diffusely
non-deterministic and tracks machine speed — a 0.1pp band can't hold
across environments.

Rather than loosen the gate, raise the floor under it: a render-smoke
spec mounts each lazy page (navigate + assert the header renders),
covering a dozen previously-untested pages and lifting coverage from
~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the
baseline conservatively (40.0), below the slow-CI floor, so the ratchet
holds without flapping.

Document the coverage policy — install the git hooks and don't bypass
them (no --no-verify, no hand-lowering the baseline or widening the
tolerance); raise coverage by adding tests instead; set the UI baseline
below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and
.agents/building-and-testing.md.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-06-01 14:24:36 +02:00
Richard Palethorpe
718223f33b feat(localvqe/audio): v1.3 release and add spectrograms to audio transform UI (#10113)
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models

Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.

Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.

Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(flake): add ffmpeg-headless to the dev shell

pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(localvqe): parse WAV by walking RIFF sub-chunks

Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(security-headers): allow blob: in connect-src for waveform fetch

The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(react-ui): add input/output spectrogram view to AudioTransform

The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).

Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.

- src/utils/fft.js            radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx  canvas heatmap (magma colormap)
- AudioTransform.jsx          dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

* test(react-ui): make UI coverage deterministic, tighten the gate

UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.

Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:

- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)

Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-31 23:56:46 +02:00
LocalAI [bot]
76fe0bb929 feat(crispasr): add CrispASR backend — multi-architecture ASR + TTS (#10099)
* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* polish(crispasr): brand error strings + fix stale shim comment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): register backend in root Makefile

Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add backend build matrix entries

Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add crispasr backend gallery entries

Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow

Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): don't wire fixture-gated test into test-extra

Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): add darwin metal build entry (mirror whisper)

The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* ci(crispasr): place hipblas matrix entry next to whisper twin

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): register crispasr as pref-only ASR backend + test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): port whisper behavioral suite (cancellation + streaming)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): fix skip message env var names to CRISPASR_*

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): switch shim to crispasr_session_* multi-architecture API

The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.

Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.

Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.

crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* build(crispasr): link full CrispASR backend set for multi-arch support

The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.

Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.

Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): adapt suite to session API (blocking, no decode callback)

Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR ASR model entries (-crispasr)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): keep only session-auto-detectable CrispASR ASR models

The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.

Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):

- Not in the detect map: data2vec, firered-asr, funasr,
  fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
  moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
  paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
  parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
  fastconformer-ctc backend by a filename heuristic in the model
  registry, which implies general.architecture is not a mapped string.

Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz)

Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): implement TTS/TTSStream gRPC methods

Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): log when a TTS voice override is not honored

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): add CrispASR vibevoice-tts model entry

Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(crispasr): gated TTS synthesis spec

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint)

The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): backend: and codec: model options (explicit arch + companion files)

Add two model-config options to the CrispASR backend via opts.Options:

- backend:<name> selects an explicit CrispASR backend (bypassing
  auto-detect) by routing load_model through
  crispasr_session_open_explicit, unlocking architectures the
  detector won't pick on its own (qwen3, cohere, granite, voxtral,
  moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
  chatterbox s3gen, or mimo-asr tokenizer) via the universal
  crispasr_session_set_codec_path setter after the session opens. A
  relative path resolves against the model directory. rc==0 means
  success or not-applicable; only a negative rc is fatal.

The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(gallery): use virtual.yaml base for crispasr models

The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts)

Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts)

speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-31 12:11:03 +02:00
LocalAI [bot]
a44bdb29d4 feat: prefix-cache-aware routing for distributed mode (#10071)
* feat(radixtree): generic prefix tree skeleton with longest-match

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Insert with path recency refresh and entry cap

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): recency-weighted per-value Weight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(radixtree): Remove all entries for a value

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(radixtree): race-free concurrency smoke test

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard

Address review findings on the generic prefix tree:

- Extract a shared pruneWalk helper parameterized by a shouldClear
  predicate and use it from Evict, Remove, and the MaxEntries path.
  Previously evictOldestLocked cleared a victim's value but never
  removed the now value-less node or its childless ancestors, so
  internal nodes accumulated under sustained churn at the cap. The
  MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
  pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
  the read lock (RLock) while Insert, Evict and Remove keep the write
  lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
  expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
  value.

Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): RoutePolicy enum with parse/resolve

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Config with defaults and validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): deterministic xxhash prefix-chain extractor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): pure filter-then-score replica selection

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): Provider interface and radix-tree-backed Index

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(prefixcache): gofmt policy enum comment alignment

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort

Address code-quality review findings in the prefixcache package.

Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.

Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.

Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.

Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): prefixcache observe/invalidate subjects and payloads

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): NATS sync publish/apply for observe and invalidate

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): ctx carrier for prefix-hash chain

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(backend): stash prompt prefix chain on ctx before distributed routing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(backend): mirror modelID fallback for prefix-chain salt parity

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scheduling config columns for prefix-cache routing

Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): optional route preference in FindAndLockNodeWithModel

Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): make Sync satisfy Provider with Evict

Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route

Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): invalidate prefix-cache entries on unload and stale removal

UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(prefixcache): rolling forced-disturb pressure counter

Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): autoscale on prefix-cache forced-disturb pressure

Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.

Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.

Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow

Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.

Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): prefix-cache fields in scheduling endpoint DTO with validation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): prefix-cache routing controls in node scheduling form

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): wire prefix-cache index, NATS sync, and config

Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.

Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(nodes): round-robin-floor invariant for prefix-cache routing

Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(prefixcache): clear branch lint findings and em dashes

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): validate prefix-cache config at startup wiring

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(radixtree): single-walk WeightsFor for batch value weights

Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.

Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): add ModelConfig.ModelID() single source of truth

The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): reuse one xxhash.Digest in ExtractChain

ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).

xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk

Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.

Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.

Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): remove Select's unreachable deterministic fallback

buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): fetch model scheduling config once per Route

GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): add Pressure.Reset to consume forced-disturb signal

Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(autoscale): at most one scale-up per reconcile tick, consume pressure

Two autoscale bugs:

1. Over-scaling: the pressure scale-up block read Pressure.Count but never
   consumed it. With a non-draining counter a single forced-disturb burst
   kept Count >= threshold across the whole window, firing scaleUp on every
   tick and pushing the model toward MaxReplicas off one transient burst.
   After a successful pressure-triggered scale-up the reconciler now calls
   Pressure.Reset to consume the signal.

2. Double scale-up in one tick: the all-replicas-busy block and the pressure
   block could both fire in the same reconcileModel pass, each calling
   scaleUp(+1) against the same `current` read once at the top, so a model
   that was both busy and over threshold scaled +2 and could overshoot
   MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
   per tick: the pressure block is skipped if the busy block already scaled,
   and scale-down is skipped in any tick that scaled up.

MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation

Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.

The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): invalidate prefix-cache on any replica removal via registry hook

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): single source of truth for threshold bounds

Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): preserve prefix-cache settings on partial scheduling update

A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.

Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts

A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.

Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort

WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.

Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns

Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.

Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(reconciler): consume autoscale signals only on a real scale-up

scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.

Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:

- pressure block (2b): set scaledUp and call Pressure.Reset only on
  success; on failure preserve the signal so the next tick retries off
  the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
  attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
  failed attempt does not reset the unsatisfiable counter.

All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(nodes): drop router's redundant prefix-cache Invalidate calls

The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).

Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.

Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(nodes): make capture+delete atomic in bulk node_models removal paths

MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.

Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.

The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.

Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(nodes): debug logging for prefix-cache routing decisions and observations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(radixtree): match shared prefixes by valuing every node on insert

Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.

Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.

Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): lock in prefix-cache routing behavior end to end

Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:

1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
   still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
   hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
   chokepoint hook to invalidate the prefix entry, and the chain fails
   over to surviving node Y and re-homes there.

Replaces the need for manual docker-compose re-runs.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(prefixcache): make prefix-cache affinity replica-granular

Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.

- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
  PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
  drop every replica of a node; Invalidate drops one replica. Select returns
  (ReplicaKey, bool) and gains a deterministic least-in-flight eligible
  fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
  PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
  Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.

This is part 1 of 2; the registry/router/wiring consumers are updated
separately.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): make prefix-cache routing replica-granular

Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).

- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
  the EXACT (node_id, replica_index) row, falling through to the default
  ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
  the specific replica, RemoveAllNodeModelReplicas and the four bulk
  node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
  exact replica the policy chose; observePrefix records the served
  ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
  Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
  on (node,0) prefers replica 0 over the same-node sibling (router unit +
  e2e), and FindAndLock locks the exact preferred replica.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): derive prefix chain from messages for tokenizer-template models

Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.

Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).

Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document prefix-cache routing roadmap

Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-30 23:24:22 +02:00
Richard Palethorpe
12d1f3a697 security(http): refuse redirects on outbound clients via hardened pkg/httpclient (#10087)
LocalAI's outbound HTTP clients used Go's default redirect policy, which
follows up to 10 redirects. On a cross-host redirect Go forwards custom
request headers — including credential headers such as Anthropic's
x-api-key — to the redirect target (Go strips Authorization, Cookie and
WWW-Authenticate cross-host, but NOT arbitrary custom headers). An
attacker able to elicit a redirect from an upstream (a hijacked or
spoofed upstream, DNS trickery, or a malicious upstream_url) then
harvests the operator's provider API key.

This was first reported against the cloud-proxy / MITM PII path
(GHSA-3mj3-57v2-4636); the same class affects every other outbound
client. Rather than patch each call site, add pkg/httpclient as the one
sanctioned constructor for outbound HTTP and route everything through it.

pkg/httpclient:
  - New(...)             refuses redirects, TLS 1.2 floor, no body
                         deadline (streaming/SSE safe)
  - NewWithTimeout(d)    simple request/response calls
  - WithFollowRedirects  opt-in following that still strips credential
                         headers on any cross-host hop; different
                         scheme/host/port == different origin, guarding
                         the curl CVE-2022-27774 port-confusion class
  - WithTransport(rt)    keep a custom transport (IP-pin, HTTP/2, a
                         credential-injecting RoundTripper)
  - HardenedTransport()  base transport with the TLS floor + bounded setup
  - Harden(c)            apply the policy to a library-supplied *http.Client
  - NoRedirect           the CheckRedirect policy; wraps ErrRedirectBlocked

Lint: a forbidigo rule flags http.DefaultClient and http.Get/Post/
PostForm/Head, pointing at pkg/httpclient (.golangci.yml,
.agents/coding-style.md). forbidigo cannot match the &http.Client{}
composite literal without also flagging legitimate *http.Client type
references, so that form is enforced by review.

Migrates every non-test outbound call site across core/, pkg/, cmd/, and
the Go backend (backend/go/cloud-proxy). Credential-bearing and
internal-RPC clients refuse redirects; download / CDN / registry clients
use WithFollowRedirects so they keep working while stripping secrets
cross-host. The only credential-bearing client that follows redirects is
the gated-download path (pkg/downloader/uri.go), which strips the token
on the cross-host hop to the CDN. Hardening this closes, in passing:
  - MCP remote-server bearer token leaking via a redirect (the
    RoundTripper re-injected Authorization on every hop)
  - agent multimedia/webhook clients leaking user-supplied auth headers
  - cors_proxy following redirects, bypassing its SSRF IP-pin
  - downloader's authorized read path leaking the token cross-host

Fixes: GHSA-3mj3-57v2-4636 (cloud-proxy leaks operator provider API key
(x-api-key) to attacker host on cross-host redirect)
Reported-by: tonghuaroot
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-30 12:04:10 +02:00
LocalAI [bot]
4a2cc64d07 feat(reasoning): honor per-request reasoning_effort on chat completions (#10082)
The OpenAI `reasoning_effort` field only reached the prompt template; it
never toggled the backend's thinking. Map it onto
ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC
metadata) in the request merge, so reasoning_effort="none" disables
reasoning per request: the use case from #10072 (run a single Qwen3-style
model and turn reasoning off for low-latency tasks while keeping it on
for others).

Effort levels (minimal/low/medium/high) enable thinking unless the model
config explicitly disabled it (reasoning.disable: true wins and is never
re-enabled by a request); "none" always disables.

Closes #10072


Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-29 22:09:07 +00:00
Ching
a473a32678 test(react-ui): cover models gallery empty-state reset flow (#10019)
Exercise the filtered empty-state path in the models gallery and verify
that the clear-filters action restores the list and resets the filter
selection.

Assisted-by: Codex:gpt-5

Signed-off-by: Ching Kao <0980124jim@gmail.com>
2026-05-29 10:39:33 +00:00
Richard Palethorpe
fbcd886a47 fix(application): stop backend processes synchronously on shutdown (#10058)
application.New wires a fire-and-forget goroutine that runs
StopAllGRPC + distributed.Shutdown when the app context is cancelled.
Callers (tests, CLI signal handler) cancel the context and then exit
immediately, so the test binary / process can terminate before that
goroutine kills the spawned backend children. go-processmanager sets no
Pdeathsig, so the orphans are reparented to init and survive — leaving
dozens of stray mock-backend processes after an e2e run.

Add Application.Shutdown(), which runs the same cleanup synchronously on
the caller's stack and is idempotent via sync.Once. The context-cancel
goroutine, the CLI signal handler, and the test suites all call it, so
cleanup is deterministic and the duplicated teardown logic collapses to
one place. The async goroutine remains as a safety net for callers that
forget; sync.Once dedupes the double call.

Wire e2e_suite_test and the two mock-backend Contexts in app_test to
call Shutdown in their AfterSuite/AfterEach.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-29 11:40:43 +02:00
泊舟
e1a782b70f fix(openai): stop streaming tool-call double-emission when autoparser is active (#10055)
Streaming /v1/chat/completions could emit the same logical tool call at
multiple `index` values. In processStreamWithTools the Go-side iterative
parser (ParseXMLIterative / ParseJSONIterative) runs on every token and
emits tool-call deltas, while the C++ chat-template autoparser delivers
its own tool calls via ChatDeltas that are flushed at end-of-stream by
ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths
active the same call is emitted twice at different indices, so OpenAI
clients that accumulate tool calls by `index` dispatch the tool N times.

Skip the Go-side iterative parser once the autoparser is producing tool
calls (hasChatDeltaToolCalls). The deferred flush stays guarded by
lastEmittedCount, so the race where the Go parser emitted before the flag
flipped also remains single-emission. Backends without an autoparser
(e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected.

Refs #9722

Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-29 11:39:09 +02:00
Richard Palethorpe
b81a6d01b3 perf(react-ui): code-split bundle, speed up coverage suite (#10042)
* Curate the highlight.js build to ~29 languages (lib/core + the
  common set) instead of the full ~190-grammar default: -787 KB raw /
  -230 KB gz on the base bundle.
* Code-split every route via React.lazy with a per-layout <Suspense>
  in App.jsx so the sidebar stays mounted on navigation. Initial entry
  chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%).
  Warm chunks on sidebar hover/focus/touch via a preload registry so
  the click finds the chunk already in flight or cached.
* Migrate Playwright coverage from istanbul (build-time counters) to
  native Chromium V8 coverage, with per-worker accumulation +
  conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the
  non-instrumented floor.
* Keep the coverage gate bundling-invariant: the coverage build inlines
  dynamic imports so every shipped source file lands in the denominator
  (otherwise untested page chunks silently drop out and inflate the
  percentage). Production builds stay code-split.
* Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to
  0.8pp now that jitter sits near istanbul's ~0.5pp again.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-28 13:43:15 +02:00
Tai An
0fd666ee6e fix(openresponses): populate Content and accept bare {role,content} items (#10039) (#10040)
* fix(openresponses): populate Content and accept bare {role,content} items (#10039)

Fixes mudler/LocalAI#10039 — `/v1/responses` silently returned empty
output on any model whose YAML doesn't include a Go-side
`template.chat_message` block.

Three cooperating bugs:

* `convertORInputToMessages` populated only `StringContent` for string
  input and for the `input.Instructions` system message, leaving the
  `Content` (any) field nil.
* `TemplateMessages` gated all fallback content-rendering branches on
  `Content != nil && StringContent != ""` — but every branch in that
  function consumes `StringContent`, not `Content`. The `&&` silently
  dropped messages that had StringContent set and Content nil, producing
  an empty prompt that the 5× empty-retry guard then turned into a
  200 OK with `output: []`.
* The array-input branch of `convertORInputToMessages` dispatched on
  `itemMap["type"]` with no default, dropping bare `{role, content}`
  items emitted by the OpenAI Python SDK helper
  `client.responses.create(input=[{...}])`.

Fix:

* Set both `Content` and `StringContent` in the two openresponses
  message-construction sites that only set one.
* Treat a bare `{role, content}` item (no `type`) as
  `type: "message"` for OpenAI-SDK compatibility.
* Gate `TemplateMessages` fallback rendering on `StringContent != ""`,
  which is what every downstream branch in that function actually
  reads.

Regression test added to `evaluator_test.go` covering the fallback
path (no `ChatMessage` template) with a StringContent-only message,
both with and without a role mapping.

* test(openresponses): guard Content population and ToProto path (#10039)

Add regression tests for the two seams the original fix touched but
left uncovered:

* convertORInputToMessages must populate both Content and StringContent
  for plain string input and for bare {role, content} array items (the
  OpenAI SDK shape that omits the type discriminator). Both are
  functional reds against the pre-fix code.
* Messages.ToProto reads Content, not StringContent — this is the path
  UseTokenizerTemplate backends (imported GGUFs) take. The cases pin
  that contract so a future regression on the producer side is caught.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-28 07:21:48 +00:00
LocalAI [bot]
373dc44992 fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e (#10031)
* fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e

The polish PR (#10030) swapped the raw <input type=checkbox> for the
shared <Toggle> component, which visually hides its native input via
opacity:0;width:0;height:0. Playwright's .check() waits for visibility
before clicking and times out after 30 s, breaking two UI E2E tests:

  - enabling fits filter hides models that exceed available VRAM
  - fits filter state persists after reload

Pass { force: true } to skip the visibility check; the input is still
the real focusable checkbox and toggles state on click. The companion
.toBeChecked() assertion only reads state and works unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): click visible Toggle track in fits-filter e2e

force:true skips the actionability checks but not the viewport check,
and the Toggle's hidden input has width:0;height:0 so Playwright still
reports "Element is outside of the viewport". Click the visible
.toggle__track inside the filter-bar-group__toggle wrapper instead —
that's what a real user clicks, and label-input association toggles
the wrapped checkbox naturally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 22:41:01 +02:00
LocalAI [bot]
02a0e70396 fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle (#10030)
* fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle

The recently added VRAM-fit filter in the Models page used a raw
<input type="checkbox"> next to the themed range slider, breaking the
visual language of the rest of the row. Swap it for the shared
<Toggle> component (already used by Backends, Settings, Traces,
AgentCreate), adopt the filter-bar-group__toggle class to drop the
duplicated inline styles, add a fa-microchip icon to mirror the
per-row fit indicator, and add a subtle left divider so the filter
reads as separate from the context-size slider on its left.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): move 'Fits in GPU' filter to filter row and unify copy

Two follow-ups on the previous polish pass:

1. Move the toggle from the context-slider row into the filter-button
   row above. The toggle is a filter on the result set, not a config
   for VRAM estimation, so it belongs with the type chips and backend
   select. The context slider stays its own thing.

2. Unify the label copy. The same locale file had "Fits in my GPU"
   for the filter and "Fits in GPU" for the per-row indicator; pick
   the shorter, possessive-free variant everywhere (en/de/es/it/zh-CN).
   Update e2e selectors to match.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 21:09:14 +02:00
LocalAI [bot]
893e69cbf8 fix(react-ui): share single /api/operations poller across consumers (#10029)
useOperations() spun up its own setInterval per hook instance, so on
pages like /app/models the OperationsBar in App.jsx plus the page's
own useOperations() call each polled /api/operations at 1 Hz - 2 RPS
sustained for the whole session, repeated on Backends and Chat.

Lift the poller into an OperationsProvider mounted under AuthProvider
so all consumers (OperationsBar, Models, Backends, Chat) share one
timer. The hook file re-exports from the context to keep call sites
unchanged.


Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-27 16:39:09 +02:00
Siddharth More
c9a1a7e6a0 UI: add 'Fits in my GPU' filter on Install Models (#10017)
* feat(ui): add GPU fit filter on models install page

* Delete docs/vram-fits-filter-backend-optionals.md

Signed-off-by: Siddharth More <siddimore@gmail.com>

---------

Signed-off-by: Siddharth More <siddimore@gmail.com>
2026-05-27 15:17:44 +02:00
Richard Palethorpe
8d70855ea6 test: add Go + React UI coverage gates and fill test gaps (#9989)
- Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline)
  run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple
  recursive roots, builds with -tags auth, and folds in the in-process
  tests/e2e suite via --coverpkg.
- React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc,
  nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a
  UI coverage gate (make test-ui-coverage-check) with a small tolerance since
  e2e line coverage jitters ~0.5pp run-to-run.
- pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage
  gate on react-ui changes; install with make install-hooks.
- New Go handler tests (settings, branding), hermetic base64 download test.
- fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate
  renders again; covered by a regression test.

Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-26 22:06:10 +02:00
LocalAI [bot]
e4c70fca7a fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk (#10000)
When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious

    data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}

chunk carrying the same JSON that was already in delta.tool_calls.

The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.

Two changes:

* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
  the analogous flag in processStream from #9985. Once any ChatDelta
  carries content or reasoning, the flag stays on for the rest of the
  stream and the worker stops falling back to the Go-side extractor for
  per-token deltas. This avoids the per-chunk leak path and the cumulative
  pollution.

* Extract chooseDeferredReasoning, a small helper that selects the
  end-of-stream reasoning source. When preferAutoparser is set, return
  functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
  extractor.Reasoning() (the correct source for vLLM and other backends
  with no autoparser).

The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).

E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-26 08:34:26 +02:00
LocalAI [bot]
f17d99f6e5 fix(streaming/tools): stop healing-marker stubs from gating off content (#9999)
* fix(streaming/tools): stop healing-marker stubs from gating off content

When the C++ autoparser is in pure-content fallback mode (e.g. qwen3
without --jinja) and the model emits a tool call as JSON, the streaming
worker calls ParseJSONIterative on each new chunk. parseJSONWithStack
heals partial input like `{` into `{"<marker>":1}` where <marker> is a
random integer. removeHealingMarkerFromJSON only stripped the marker
from values, so the synthetic key survived and downstream callers saw
a stub object with a random-looking key.

chat_stream_workers.go's JSON tool-call detector then bumped
lastEmittedCount past the stub even though no real tool call was
emitted, gating off ALL subsequent content chunks. The qwen3 + tools +
streaming case ended up dribbling only the first `{"` to clients and
then nothing, even when the model went on to call the noAction
`answer({"message": "…"})` pseudo-tool.

Three changes, each with its own regression test:

* removeHealingMarkerFromJSON now strips the marker suffix from keys
  too, dropping the entry when the truncated key is empty. Inputs like
  `{` no longer leak `{"<marker>":1}` to callers; partial keys like
  `{ "code` still preserve the model-typed prefix `code`.

* ParseJSONIterative skips empty-after-healing maps so a healed `{`
  doesn't surface as a stub result.

* The streaming JSON detector now breaks (not continues) on entries
  without a usable `name`, and only bumps lastEmittedCount past
  successfully-emitted entries. Defense-in-depth against any future
  partial-parse shape.

The parser tests cover eight partial-JSON-prefix shapes and verify no
marker characters leak into keys, plus the two early shapes (`{`,
`{"`) that should not surface a stub at all.

Fixes #9988

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(streaming/tools): cover the autoparser-correctly-working path

Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas
and unit-test it against every shape that can hit the streaming worker:

* the bug case — a healing-marker stub at index 0 must NOT bump
  lastEmittedCount, so subsequent content chunks keep flowing;
* the autoparser-correctly-working case — empty jsonResults (because
  the C++ autoparser cleared the raw text and delivers tool calls via
  TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream
  emitter to ship the autoparser's tool calls;
* a single complete tool call — emit one chunk, advance to 1;
* arguments arriving as a JSON-string vs as a nested object — both
  serialize to the wire as JSON-string arguments;
* multiple parallel tool calls — one chunk each;
* a real tool call followed by a partial stub — emit the real one,
  stop at the stub, resume on a later chunk once the stub completes.

Locks down the no-regression guarantee the user asked for: this PR's
fix is scoped to the pure-content fallback path; when the autoparser
actually classifies tool calls (jinja-recognized chat format with tool
support), the helper is a no-op and nothing changes.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 23:55:35 +02:00
LocalAI [bot]
1c6c3adad6 fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode (#9991)
When LocalAI templates a thinking model outside of jinja (the default for
the qwen3 gallery family), llama.cpp's chat parser falls back to a
"pure content" PEG parser that dumps the entire raw response into
ChatDelta.Content with an empty ReasoningContent. The Go side then
trusted that content verbatim and overrode tokenCallback's
correctly-split reasoning, so <think>...</think> blocks ended up in the
OpenAI `content` field. Regression from v4.0.0 introduced when the
autoparser ChatDeltas path was added (#9224).

The override now runs Go-side reasoning extraction defensively when the
autoparser delivered content but no reasoning. The streaming worker
gains a sticky preferAutoparser flag that flips on the first chunk
where the autoparser classified reasoning_content; until then we use
the streaming Go-side extractor. Realtime mirrors the non-streaming
fallback. When the autoparser already populated ReasoningContent we
trust it untouched, so jinja-enabled installs are not regressed.

gallery/qwen3.yaml now enables use_jinja, letting the autoparser
classify <think> natively for all 20+ qwen3 family entries that share
this template.

Fixes #9985

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 22:39:50 +02:00
LocalAI [bot]
8d6548c0b9 fix(distributed): sync gallery OpCache + caches across frontend replicas (#9983)
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:

- A user installing a model on replica A saw the operation card flicker
  in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
  useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
  on replica A failed to find the new model — B's ModelConfigLoader was
  still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
  backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
  that had already shipped.

Mirror the jobs Dispatcher pattern for gallery ops:

- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
  hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
  Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
  operations row and broadcast OpCacheEvent so peers merge it in. The
  hydrate path uses a new GalleryStore.ListActive() (status in {pending,
  downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
  Wildcard subscriber that calls a new lock-light mergeStatus into the
  local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
  runs the locally-registered cancel func. Hydrate() restores active rows
  from PostgreSQL on startup so a freshly-started replica is not
  observably empty mid-install. CancelOperation tolerates the cancel func
  living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
  SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
  a successful install/delete/upgrade. SubscribeBroadcasts wires peers
  to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
  OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
  replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
  so a failed install replicated to a peer arrived with a nil error and
  the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
  via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
  to NATS or persisting to PostgreSQL. The wildcard subscriber's
  mergeStatus loops back into the same service on the publishing replica
  and would deadlock otherwise; this also prevents future PG round-trips
  from stalling concurrent readers on every progress tick.

Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.


Assisted-by: claude-code:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 17:28:14 +02:00
LocalAI [bot]
06e777b75e feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) (#9976)
* feat(distributed): add per-request node ID context holder

Introduce pkg/distributedhdr, a leaf package carrying a per-request
*atomic.Value holder for the picked worker node ID from the
SmartRouter (core/services/nodes) up to the HTTP response writer
wrapper (core/http/middleware). Avoids the import cycle that a shared
key in either consumer would create.

Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The
holder is atomic.Value so cross-goroutine publish from the router to
the response writer wrapper is race-clean.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): add ExposeNodeHeader middleware + response writer wrapper

New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI
flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID
reveals internal topology and is opt-in).

The middleware creates a per-request *atomic.Value holder, attaches it
to c.Request().Context() via distributedhdr.WithHolder, and wraps
c.Response().Writer with a custom http.ResponseWriter that sets the
X-LocalAI-Node header on first Write / WriteHeader / Flush by reading
the holder. Implements http.Flusher, http.Hijacker, Unwrap so it
composes cleanly with Echo and http.NewResponseController.

request.go propagates the holder onto derived contexts via
distributedhdr.Inherit so the holder survives the correlation-ID
context replacement.

Unit + race-clean concurrency + integration specs.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): stamp node ID in router and wire middleware to inference routes

ModelRouterAdapter.Route stamps the picked node ID into the
per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right
after replica selection.

Wire ExposeNodeHeader middleware to:
- OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting
- Anthropic /v1/messages
- Ollama /api/chat, /api/generate, /api/embed, /api/embeddings
- Jina /v1/rerank
- LocalAI /v1/vad

The middleware's wrapper reads the holder on first byte and sets the
X-LocalAI-Node response header before delegating to the underlying
writer. Per-request scope means no race under concurrent multi-replica
routing.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): thread request context through backend Load + cover ctx propagation

Five non-OpenAI backend helpers were silently using app.Context instead
of the request context for the gRPC backend call: transcription, TTS,
image generation, rerank, VAD. Effect: distributedhdr.Stamp in the
router callback was a silent no-op for these paths, AND client
cancellation didn't propagate to in-flight inference.

Thread c.Request().Context() (or the equivalent input.Context after
the request middleware has installed the correlation-ID derived
context) through each helper and into ModelOptions via
model.WithContext(ctx). ImageGeneration's signature gains a leading
ctx parameter; in-tree callers (openai image, openai inpainting,
openai inpainting_test) are updated to match.

ModelEmbedding gains a leading ctx parameter for the same reason; the
openai and ollama embedding handlers pass the request context through.

chat_stream_workers.go defers the initial role=assistant chunk
emission until the first token callback so the wrapper's lazy
X-LocalAI-Node lookup against the loader runs AFTER ml.Load has
stamped the per-modelID node ID; semantically identical for clients
(role still arrives before any text).

Regression test core/backend/ctx_propagation_test.go pins ctx
propagation for all five helpers.

Docs updated to enumerate the full endpoint coverage of the
--expose-node-header flag.

Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-25 10:51:48 +02:00
Richard Palethorpe
6a80e23733 feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802)
Add a routing middleware stack and a cloud-proxy backend.

* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
  Anthropic-shaped chat requests to upstream providers, with an
  optional translate mode (OpenAI request -> Anthropic /v1/messages
  -> OpenAI response) and full tool-calling support.

* routing: admission control, content-aware model routing
  (embedding cache + classifier + rerank + Arch-Router score),
  PII detection/redaction (regex + NER) with streaming filter and
  OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
  backed by GORM or in-memory storage.

* middleware: UsageMiddleware records usage via the billing recorder,
  plus admission, route-model, usage-stamp and trace middlewares.

* observability: BackendTrace ring buffer stores full request bodies
  (capped), MITM proxy emits structured trace events, and router
  classifier decisions surface at /api/router/decide.

* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).

* UI: cloud-proxy model-editor fields, classifier system-prompt and
  score-normalization config, and a Traces page rendering request
  bodies.

Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-05-25 09:28:27 +02:00
LocalAI [bot]
1198d10b58 fix(traces): cap backend trace Data to keep admin UI responsive (#9960)
* fix(traces): cap backend trace Data field so the admin UI stays responsive

The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:

  - LLM backend traces store the full chat messages JSON, full response, and
    full streaming deltas. Every agent-pool reasoning step ships the full
    RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
  - TTS / audio_transform / transcript traces embed a 30s audio snippet as
    base64, around 1.3 MiB per trace.

Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.

Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:

Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.

Option B (head-preserving at the producer):
  - core/backend/llm.go: TruncateToBytes on messages, response, and
    chat_deltas content/reasoning_content so the leading content stays
    readable in the UI.
  - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
    blob would exceed the cap (truncated base64 is undecodable). The
    quality metrics still ship and the UI's WaveformPlayer simply skips
    when the field is absent.

TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels

The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.

Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes

The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 14:50:40 +02:00
LocalAI [bot]
a0f3e26245 fix(distributed): make admin backend installs resilient and observable (#9958)
* feat(distributed): add configurable NATS backend install/upgrade timeouts

Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig
with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout
pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter
so admin-driven backend installs across the cluster survive long OCI image
pulls that previously timed out at 3m.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(distributed): gofmt alignment after timeout fields

Re-aligns the Validate() negative-duration map and the Default* const
block so the new BackendInstall/UpgradeTimeout entries do not leave
the surrounding columns mis-padded.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT

Parses the two new env vars on the run CLI and threads them through the
existing AppOption builder so DistributedConfig picks them up. Invalid
duration strings now fail loudly at startup rather than silently falling
back to the default.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter

Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and
threads in DistributedConfig.BackendInstallTimeoutOrDefault() and
BackendUpgradeTimeoutOrDefault() at construction. Install now defaults
to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew
past the old ceiling. Scripted messaging client captures the timeout
so tests can assert the configured value actually reaches the NATS
request.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel

When the NATS request-reply for backend.install (or .upgrade) times out
the worker is almost always still pulling the OCI image. Wrap the timeout
in a typed sentinel so the manager above can distinguish "worker hung"
from "worker still working" and leave the pending_backend_ops row in
place for the reconciler to confirm via backend.list.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): treat NATS install timeout as in-progress, not failure

When a worker times out replying to backend.install but the install is
still running on the worker, enqueueAndDrainBackendOp now reports a
running_on_worker status and pushes NextRetryAt out by the install
timeout so the reconciler does not immediately re-fire another install
while the worker is still pulling the image. The pending_backend_ops
row stays in place for the next reconciler pass to confirm via
backend.list.

InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling
so callers can branch (galleryop renders yellow in-progress instead of
red error). UpgradeBackend uses the same wrap.

Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push
NextRetryAt by the configured timeout without reaching into a private
field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft
cousin of RecordPendingBackendOpFailure.

Also includes incidental gofmt-driven struct-field alignment in
registry.go on lines unrelated to the change (touched files are
re-formatted to canonical form per project policy).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(distributed): don't increment Attempts on in-flight install timeout

An in-flight timeout (worker still pulling the OCI image) is not a
failed attempt, it's a delayed one. Incrementing Attempts let
genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi)
trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter
the queue row while the worker was still legitimately working.

RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt.
Also documents "running_on_worker" in the NodeOpStatus.Status enum
comment so Task 6 implementers see the full surface.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus

When the distributed backend manager returns an error that wraps
ErrWorkerStillInstalling, backendHandler now completes the op with a
"still installing in background" message rather than marking it as a
red failure. Admin UI sees a yellow in-progress state; reconciler
confirms completion on its next pass.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): end-to-end install-timeout-then-reconcile

Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather
than during a real cluster install. NATS times out, the queue row
stays alive with running_on_worker status, the worker eventually
reports the backend installed via backend.list, the manager surfaces
it via ListBackends.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT

Add the two new operator-tunable env vars to the Frontend Configuration
table in the distributed-mode docs. Explains the 15m default, when to
raise it (slow links pulling multi-GB OCI images), and the new
"still installing in background" admin-UI state when the round-trip
times out but the worker is still working.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): clear pending install rows when backend.list confirms

DistributedBackendManager.ListBackends now proactively clears
pending_backend_ops install rows whose (nodeID, backend) is reported
installed by backend.list. Operator UI updates immediately instead of
waiting up to installTimeout (default 15m) for the next reconciler
tick after NextRetryAt.

Only install rows are cleared; upgrade and delete intents are not
satisfied by presence in backend.list and continue to drain through
their normal reconciler paths.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(messaging): add BackendInstallProgressEvent wire type and subject

New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the
worker publish transient progress events (file, current/total bytes,
percentage, phase) while a long-running install pulls its OCI image.
BackendInstallRequest gains an optional OpID field so the worker knows
which subject to publish on.

Transient pub/sub (not JetStream): the install reply remains ground
truth for success/failure; dropped progress events are tolerable.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* style(messaging): drop em-dash from BackendInstallProgress test comment

Per project convention (no em-dashes anywhere). Comment substance is
unchanged.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): worker publishes debounced install progress over NATS

When BackendInstallRequest.OpID is set, the worker's backend.install
handler wires a debounced publisher (250ms window) into the gallery
download callback. Each tick becomes a BackendInstallProgressEvent on
nodes.<nodeID>.backend.install.<opID>.progress; the publisher always
emits a final event on Flush so the UI sees the terminal percentage.

Old masters that do not set OpID continue to run silent installs: no
behavior change for them. Lock ordering: the publisher releases its
mutex before calling messaging.Publish so a slow network never stalls
the install loop.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): RemoteUnloaderAdapter subscribes to install progress

InstallBackend gains opID + onProgress parameters. When both are set,
the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress
BEFORE publishing the install request, decodes each message into the
caller's onProgress callback in a goroutine (so a slow callback never
stalls the NATS reader thread), and unsubscribes after RequestJSON
returns.

When onProgress is nil OR opID is empty (the reconciler retry path),
subscription is skipped entirely - silent installs cost nothing extra.

Subscribe failure is logged at Warn and the install proceeds without
progress streaming; the NATS round-trip still owns terminal status.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): forward backend install progress into galleryop OpStatus

DistributedBackendManager.InstallBackend now passes the gallery op ID
and a progress bridge into the adapter call. Each
BackendInstallProgressEvent from the worker becomes a
galleryop.ProgressCallback tick - which the existing backendHandler
already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling
sees per-byte progress for distributed installs without any UI-side
change.

UpgradeBackend is intentionally left silent for now: its wire request
(BackendUpgradeRequest) does not carry OpID, and rolling-update
fallback is the rarer path. Will be picked up in a follow-up if the
worker upgrade path also gets a progress channel.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers

A worker on pre-Phase-2 code never publishes progress events. The new
master subscribes optimistically; this spec pins that a silent worker
still produces a green install with no progressCb ticks. The install
reply is the source of truth for terminal state; the progress stream
is a best-effort UX enrichment.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document install progress streaming

Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and
the silent-worker compatibility behavior so operators know to expect
real-time progress and what happens on a mixed-version cluster.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): note progress-event ordering trade-off in InstallBackend

Document near the goroutine dispatch why ordering at the consumer is
best-effort, why it rarely matters in practice (worker debounce >>
goroutine jitter), and what a future hardening pass would look like
(Seq field + stale-by-seq drop). Stops the next reader from accidentally
"fixing" the goroutine pool away.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown

Adds the data model the UI needs to render an expandable per-node
breakdown of a fanned-out backend install. NodeProgress carries node
identity (ID + name), per-node status (queued / running_on_worker /
success / error / downloading), the current file + bytes + percentage
from the Phase 2 progress stream, and any per-node error.

OpStatus.Nodes is the slice the /api/operations handler will surface
in a follow-up.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID

GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress
into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the
latest tick into the aggregate Progress / FileName /
DownloadedFileSize / TotalFileSize fields so the legacy single-bar
OperationsBar view keeps working unchanged alongside the new per-node
breakdown.

Concurrent-safe via the existing g.Mutex.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(distributed): write per-node OpStatus entries during install fan-out

DistributedBackendManager now accepts a nodeProgressSink and feeds it
two streams:

1. enqueueAndDrainBackendOp emits a per-node terminal entry on each
   status it appends to BackendOpResult (queued, success, error,
   running_on_worker). The opID is threaded through the function so
   the sink gets the right gallery op identity.

2. The install apply closure fans each BackendInstallProgressEvent
   into the sink as a downloading entry, alongside the legacy
   progressCb path so the aggregate single-bar view stays correct.

Production wiring passes the GalleryService (which implements
UpdateNodeProgress via Task 2) as the sink. Single-node tests pass
nil. DeleteBackend and UpgradeBackend pass an empty opID so the
sink path no-ops for ops that aren't gallery-tracked the same way
as Install.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(operations): expose per-node breakdown on /api/operations

When an operation's OpStatus has Nodes entries (populated by the
Phase 4 progress sink wiring), surface them as a "nodes" array on the
/api/operations response, sorted by node_name for stable rendering.

Backward compatible: legacy clients ignore the field; ops without any
node entries (single-node mode, model installs) omit the array entirely
thanks to the empty-slice guard.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): per-node breakdown in OperationsBar

When an install op fans out to more than one worker, the operations
bar now shows a "N nodes" chevron that expands into a per-node list.
Each row carries the node's status (color-coded pill), the current
file being downloaded, byte counts, percentage, and a thin per-node
progress bar. Yellow "Worker busy" pill marks running_on_worker
status with a tooltip explaining the NATS round-trip timed out but
the worker is still installing in the background.

Backward compatible: ops without a nodes field (legacy or single-node
mode) render as before. State for expand/collapse is local to the
component, keyed by jobID/id - reload starts collapsed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): document per-node breakdown in the operations bar

Adds a short subsection covering the expandable "N nodes" chevron in
the OperationsBar admin UI, the meaning of each status pill, and
how it relates to the /api/operations nodes array.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(galleryop): UpdateStatus preserves Nodes when caller sends none

Real-world bug surfaced by the Phase 4 multi-worker smoke test: the
nodes[] array in /api/operations flickered between a single node at a
time on a 2-worker install. Root cause: the Phase 2 progress bridge
also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on
every tick. UpdateStatus then overwrote the entire status pointer,
wiping the Nodes slice that UpdateNodeProgress had just merged in.

Fix: in UpdateStatus, if the incoming op has an empty Nodes slice,
carry forward the previous status's Nodes before storing. Callers
that explicitly populate Nodes still win (their slice replaces the
prior one, no merge across the two code paths).

Two regression specs added pinning both directions of the contract.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(distributed): strip implementation details from user-facing docs

Trim the new install/upgrade timeout rows and the install-progress
sections to focus on what the operator sees and tunes. Drops:

- the NATS subject names and pub/sub mechanics
- "round-trip" / reconciler / backend.list jargon
- /api/operations polling cadence
- "pre-2026-05-22" version references

Reframes the breakdown text around the admin UI (Operations Bar,
chevron, status pills, "Worker busy" tooltip). Implementation context
lives in the agent notes and code comments.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(config): move DistributedConfig.Validate flag names to constants

The negative-duration check map was a wall of literal kebab-case
strings that had to stay in sync with the kong-derived CLI flag names
manually. Move them to a Flag* const block alongside the existing
Default* block so a rename of either the Go field or the CLI naming
convention forces a compile error rather than silent drift.

Sole consumer today is Validate; the constants are exported so future
operator-facing surfaces (e.g. error messages on other validation
paths) can reference them by name instead of repeating the literals.

Tests pin both the literal values (so a future "let's just rename
this" doesn't accidentally regress the CLI flag) and the negative-
duration error message for the new BackendInstall / BackendUpgrade
fields.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(distributed): extract NodeStatus and Phase enums to constants

Sweep for the same literal-string-as-identifier pattern called out on
the Validate flag names: the per-node install status enum
("queued" | "downloading" | "running_on_worker" | "success" | "error")
appeared as raw literals across managers_distributed.go (10+ sites,
including 3 separate `n.Status == "running_on_worker"` checks),
operation.go, and the test suite. Same shape for the Phase enum
("resolving" | "downloading" | "extracting" | "starting") in the
worker-side progress publisher.

Promote both to exported const blocks:

- galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error}
  shared between galleryop.NodeProgress.Status (the wire field) and
  nodes.NodeOpStatus.Status (the in-process per-node summary)
- messaging.Phase{Resolving,Downloading,Extracting,Starting}
  shared between the worker publisher and any future consumer that
  needs to switch on phase

Tests pin both the literal values (so a future "let's just rename" doesn't
silently change the JSON wire) and use the constants in setup (so the
producer side stays drift-protected). Wire-format assertions on the
/api/operations JSON output keep their literals deliberately, so the
constant value can never silently diverge from what the UI receives.

Out of scope for this PR (separate cleanup): the finetune and
quantization job-status enums have the same anti-pattern with 14+
literal sites each, but predate this PR's work.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-23 12:35:44 +02:00
LocalAI [bot]
834ecc36bf fix(react-ui): unify backend-logs entry point for distributed mode (#9949)
In distributed mode the local /api/backend-logs WebSocket has nothing
behind it (inference runs on workers), so the "View backend logs" link
in Traces (and the action in Manage when previously not hidden) dead-
ended on /app/backend-logs/<modelId>. Manage worked around it by
hiding the action; Traces still rendered the link.

Make /app/backend-logs/:modelId the single, mode-aware entry point.
A new BackendLogsRouter probes useDistributedMode and forks:

  - standalone: existing local WebSocket view (BackendLogsDetail).
  - distributed: DistributedBackendLogsResolver fans out to each node
    via nodesApi.getModels, filters by model_name, and routes:
      * 0 hits   -> empty state with a link to the Nodes page.
      * 1 hit    -> <Navigate replace> to
                    /app/node-backend-logs/<nodeId>/<modelId>,
                    preserving the ?from= deep-link timestamp.
      * N hits   -> picker listing each hosting worker (node id,
                    replica index, load state) so the operator can
                    choose which worker's logs to view.

Bare modelId in the redirect target intentionally aggregates that
node's replicas via the worker's BackendLogStore, matching the
existing per-node link pattern in Nodes.jsx.

Revert the per-caller distributed checks now that routing is
centralised: drop the hidden:distributedMode guard on Manage's
Backend logs action, and remove the prop threading in Traces so the
link is unconditional. Any future view that wants to link to backend
logs uses the same URL and gets correct behaviour in both modes.

Assisted-by: Claude:claude-opus-4-7 [Claude Code]

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 22:00:08 +02:00
LocalAI [bot]
61bf34ea2f fix(traces): cap captured body size to keep admin Traces UI responsive (#9946)
The trace middleware buffered the full request and response bodies for every
JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses
(large vector arrays) accumulated to tens of MB in the in-memory buffer; the
admin Traces page would then download and parse 40+ MB on every load and on
every 5s auto-refresh, locking the UI in a loading state.

Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured
body. The full payload still flows through to the real client; only the
trace copy is bounded. Exchanges record body_truncated and original
body_bytes so the dashboard can show that truncation happened. The cap is
configurable via env, CLI, and runtime_settings.json.

Also unblock recovery: the Traces page now keeps the Clear button enabled
while loading, since "buffer too large to render" is exactly when the user
needs to clear it.


Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 15:29:24 +02:00
LocalAI [bot]
0b2ae3c6ca fix(openai): stream usage non-zero when tools are enabled (#9941)
* chore: ignore local .worktrees directory

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(openai): stream usage non-zero when tools are enabled

The streaming chat-completions worker for tool-bearing requests
(processTools in core/http/endpoints/openai/chat.go) never forwarded the
cumulative TokenUsage from ComputeChoices to the chunks it placed on the
responses channel. The outer streaming loop's running usage tracker
therefore stayed at the zero value, and the include_usage trailer
reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever
the request carried a `tools` array. Without tools, the alternative
`process` path stamps Usage on every chunk, so that path was unaffected.

Forward the final TokenUsage via a usage-only sentinel chunk (empty
Choices, populated Usage) emitted right before close(responses). The
outer loop's per-chunk Usage capture moves above the empty-Choices skip
so the sentinel updates the tracker without ever reaching the wire,
keeping the existing OpenAI spec contract (intermediate chunks carry no
`usage` field, and the deferred-final-chunk helpers remain Usage-free
per the regression test for issue #8546).

Adds streamUsageFromTokenUsage, usageSentinelChunk, and
applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level
test that mirrors the outer-loop sequence.

Fixes #9927

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

* refactor(openai): return final TokenUsage from stream workers

Replace the usage-only sentinel SSE chunk introduced in the previous
commit with a plain return value. The streaming workers process and
processTools (now extracted as package-level processStream and
processStreamWithTools) return (backend.TokenUsage, error); the outer
ChatEndpoint loop reads the cumulative counts off the existing `ended`
channel (now carrying streamWorkerResult{usage, err}) and builds the
include_usage trailer from a normal Go value after the LOOP exits.

This drops the empty-Choices "skip but capture Usage" rule from the
outer loop and removes the usageSentinelChunk / applyChunkToUsage
helpers entirely. The SSE responses channel is back to a single
purpose: wire chunks only.

processStream and processStreamWithTools move into chat_stream_workers.go
so they can be exercised directly from tests. The chat_stream_usage_test.go
suite now drives the workers with a mocked backend.ModelInferenceFunc
and asserts on the returned TokenUsage. The regression coverage for
issue #9927 is therefore behavioral: reverting the fix (discarding
ComputeChoices' usage return) makes the assertions fail with concrete
count mismatches.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-22 10:13:41 +02:00
LocalAI [bot]
f0cb02afb8 feat(usage): attribute Sources rows to user accounts in admin view (#9935)
The merged feature (#9920) let admins see per-API-key and per-source
totals but did not surface which user owned each key, and lumped
every user's Web UI traffic into a single global Web UI row. This
makes the admin Sources tab properly per-user attributable:

- KeyTotal gains UserID + UserName, populated from the snapshot the
  usage middleware already records. The by_key roll-up now groups by
  (api_key_id, api_key_name, user_id, user_name).
- New SourceTotals.ByUserSource roll-up groups (source, user_id,
  user_name) for sources without a key identity (web, legacy). Only
  populated on the admin path (includeLegacy=true); the non-admin
  endpoint stays unchanged for backwards compatibility.
- SourcesTable accepts showUserColumn={isAdmin}; admin view renders
  a User column, makes the search match user name/id, and expands
  Web UI / legacy pseudo-rows from the global aggregate to one row
  per user using by_user_source.

Refs: #9862

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 23:23:06 +02:00
LocalAI [bot]
a39e025d64 fix(nodes): make per-node backend install async via gallery job queue (#9928)
* feat(galleryop): add TargetNodeID to ManagementOp for single-node installs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(galleryop): add NodeScopedKey helpers for per-node opcache rows

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(http): make /api/nodes/:id/backends/install async via gallery service job queue

The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.

The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): log malformed backend_galleries override and stop test drain goroutine

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(api): expose nodeID for node-scoped backend ops in /api/operations

Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(react-ui): poll job status for node-targeted backend installs

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(react-ui): clarify async semantics in handleInstallOnTarget

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(http): use statusUrl casing for node install response to match codebase precedent

Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 22:25:53 +02:00
LocalAI [bot]
f15b9178ec feat(usage): track and visualise usage per API key (#9920)
* feat(usage): add Source, APIKeyID, APIKeyName columns to UsageRecord

Adds three additive columns plus UsageSource* constants. The columns
are auto-migrated by InitDB. APIKeyID is a nullable foreign reference
to UserAPIKey.ID; APIKeyName is snapshotted on each row so revoked
keys keep showing their name in history.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): backfill Source on pre-feature usage rows

InitDB now classifies any pre-existing usage_record with an empty
source: 'legacy-api-key' user -> legacy, everything else -> web.
The backfill is idempotent (only touches NULL/empty rows).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add GetUserUsageBySource aggregator

Groups by (bucket, source, api_key_id, api_key_name). Filters out
legacy by default. Returns both per-bucket detail and roll-ups
(by_source, by_key sorted desc and capped at 200, grand_total).

The MAX(created_at) projection is iterated via Rows().Scan into a
string column and parsed manually because the SQLite driver surfaces
the aggregated timestamp as a string, which database/sql refuses to
scan directly into time.Time. Postgres returns a real timestamp; the
same string path handles its RFC3339 form too.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): log Rows() errors and assert LastUsed in tests

Adds rows.Err() and Rows() open-failure logging in
computeSourceTotals so silent data drops surface in logs. Logs on
parseLastUsedString format misses for the same reason. Strengthens
the snapshot-survival test to assert LastUsed is a recent timestamp,
locking the SQLite time-string parser behaviour.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add admin GetAllUsageBySource with filters and truncation

Optional user_id and api_key_id filters (composed with AND). Legacy
bucket is included for admin callers. truncated=true when more than
200 distinct keys would be in the by_key roll-up.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(auth): plumb auth_source and auth_apikey through Echo context

tryAuthenticate now sets auth_source on every successful branch
(web for session/Bearer-session, apikey for Bearer-key/x-api-key/
token-cookie, legacy for legacy env key match). For named-key
branches it also stores the resolved *UserAPIKey under auth_apikey
so downstream middlewares can snapshot id+name without re-validating.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(auth): expand tryAuthenticate godoc and cover Bearer-session branch

Documents all three context-keys side effects (auth_source,
auth_apikey, _auth_session) plus the split of responsibilities with
the parent Middleware. Adds a test for the Bearer-as-session-token
classification so future regressions there fail loudly.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): UsageMiddleware records source + snapshots key name

Reads auth_source and auth_apikey from the Echo context (set by
auth.Middleware in the previous task). Snapshots UserAPIKey.ID and
Name onto each row so revoked keys remain readable in history.
Falls back to source=web when no auth_source is set (auth disabled
or unrecognised path).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): add /api/auth/usage/sources and admin variant

Self endpoint filters legacy server-side; admin endpoint includes
legacy and accepts user_id + api_key_id filters. Response includes
buckets, totals.{by_source, by_key, grand_total}, and a truncated
flag set when the per-key roll-up was capped at 200.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(routes): mark test mirror handlers as keep-in-sync with production

The newTestAuthApp helper duplicates production route handlers
inline because it cannot use RegisterAuthRoutes (which requires a
*application.Application). Naming the source path on each mirror
makes the drift contract explicit for future maintainers.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add usageApi.getMySources/getAdminSources + i18n strings

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): add Sources tab skeleton with data fetch

Adds Usage page tab that fetches /api/auth/usage/sources (or the
admin variant). Renders raw totals plus a placeholder key list;
real visualisations land in subsequent commits. Restructures the
existing tab button block so Models and Sources are visible to
non-admins (Users remains admin-only).

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): source mix ribbon + searchable/sortable sources table

Replaces the SourcesTab placeholder rendering with two reusable
components: SourceMixRibbon (one segmented bar per source class)
and SourcesTable (search + sort + revoked-key dim). Pulls the
current API key list to detect revoked keys.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ui): skip revoked-key detection until the key list is known

existingKeyIds defaulted to an empty Set, which made every live
api_key row render as (revoked) during the brief window before
apiKeysApi.list() resolved, and permanently after a fetch failure.
Use null as the unknown state and suppress the revoked badge until
the parent provides a real Set.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(ui): top-N stacked time chart and drill-in chip for Sources tab

Top 7 sources by total tokens get distinct colours; the rest roll up
into 'Other'. Clicking a row in the SourcesTable dims everything
except that series in the chart; the chip is the canonical clear.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs(usage): document per-API-key Sources tab and endpoints

Extends features/authentication.md Usage Tracking section with:
- A 'Sources' tab description and source-class taxonomy
- Endpoint documentation for /api/auth/usage/sources and the
  admin variant
- Response shape example with by_source / by_key / grand_total
- Migration note about pre-feature row backfill

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(usage): silence errcheck on deferred rows.Close

CI errcheck flagged the bare 'defer rows.Close()' in
computeSourceTotals. Wrap in a closure that discards the close
error explicitly; an error here is non-actionable since we have
already drained the rows and logged any iteration failure.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactor(usage): bound batcher intake and add Shutdown/FlushNow hooks

The pre-existing usage batcher had no cap on its add() path; the
usageMaxPending=5000 constant only guarded the re-queue path after
a failed write, leaving memory growth unbounded if the DB fell
behind. This commit:

- Adds the cap to add() so saturation drops new records (rate-limited
  warn at 1/1024) instead of growing unbounded.
- Raises usageMaxPending to 50000 to absorb realistic inference bursts.
- Replaces the package-level batcher global with a mutex-guarded pair
  plus a currentBatcher() accessor so Init / Shutdown cycles are
  race-free.
- Adds ShutdownUsageRecorder() for graceful drain on process exit
  (not yet wired into app shutdown, just published).
- Adds FlushNow() for deterministic tests; the middleware suite no
  longer needs 6s sleeps per spec and now runs in ~50ms instead of 18s.
- Re-queue on failed flush is now cap-aware: prepends as much of the
  failed batch as fits alongside concurrent arrivals, instead of
  dropping the whole batch when full.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(usage): drain usage batcher on graceful shutdown

Registers ShutdownUsageRecorder with the existing
signals.RegisterGracefulTerminationHandler so SIGINT/SIGTERM
synchronously flushes any in-memory usage records before the
process exits. Without this, up to one flush interval (5s) of
recorded usage was lost when LocalAI restarted.

Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 16:34:02 +02:00
LocalAI [bot]
11d5bd0cc3 fix(react-ui/chat): stop wiping selection on every /api/operations poll (#9904) (#9917)
useOperations() was calling setOperations() with a fresh array on every
1s poll, even when the payload was identical. In React 19 the DOM diff
no longer short-circuits dangerouslySetInnerHTML on equal __html, so the
forced Chat re-render re-assigned innerHTML on every assistant message
once per second — wiping any text the user had selected.

Skip the state update when the serialised operations payload is
unchanged, and switch loading/error to functional setters so they also
short-circuit at the source.

Also fixes the chat copy button on plain HTTP: navigator.clipboard is
undefined in non-secure contexts (a common LXC+Docker deployment), but
the previous code called it unconditionally and showed a success toast
regardless. Routed Chat, AgentChat and CanvasPanel through a new
copyToClipboard() helper that uses navigator.clipboard when available
and falls back to a hidden-textarea + execCommand('copy') trick that
browsers still honour outside secure contexts. The fallback preserves
the user's existing selection.

Regression coverage in e2e/chat-polling-selection.spec.js: a
MutationObserver counts mutations on the assistant content node across
3s of polling (must be 0); the copy test stubs out navigator.clipboard
and asserts that execCommand('copy') is invoked.


Assisted-by: claude-opus-4-7-1m

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-21 12:17:51 +02:00
Daniel Liljeberg
fc3980dadd fix: inject text-file content into chat completions messages (#9896)
Non-image/non-audio file attachments (txt, md, csv, json) were being
  stored in the 'files' metadata field but never added to the message
  content array sent to /v1/chat/completions. Images and audio correctly
  received content blocks; files did not.

  Fix: push a text content block into messageContent when textContent is
  present, matching the pattern used for image_url and audio_url.

  Also fixes Home.jsx addFiles which never called file.text() at all,
  meaning files attached on the home screen had empty textContent even
  before reaching useChat.js.

  Note: PDF files use file.text() which returns raw bytes rather than
  parsed text. Proper PDF support would require PDF.js or server-side
  extraction and is not part of this fix.

Signed-off-by: Daniel Liljeberg <damien_@hotmail.com>
2026-05-20 01:00:32 +02:00
LocalAI [bot]
a39591f144 realtime: honor output_modalities to skip TTS in text-only mode (#9838)
* realtime: honor output_modalities to skip TTS in text-only mode

The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.

This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.

Assisted-by: Claude:claude-opus-4-7

* realtime: plumb response-level output_modalities and echo on session

Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
  override of an audio session is honored (the test asserted this contract
  but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
  round-trips the client-supplied value, matching MaxOutputTokens's pattern.

Assisted-by: Claude:claude-opus-4-7

* realtime: silence errcheck on deferred os.Remove of TTS file

CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."

Assisted-by: Claude:claude-opus-4-7 golangci-lint

---------

Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-05-15 12:39:47 +02:00
massy_o
745473cbe6 Validate video image URLs before download (#9819)
Signed-off-by: massy-o <telitos000@gmail.com>
2026-05-14 15:07:17 +02:00