Bumps the pip group with 1 update in the /backend/python/transformers directory: torch.
Updates `torch` from 2.7.1 to 2.7.1+xpu
---
updated-dependencies:
- dependency-name: torch
dependency-version: 2.7.1+xpu
dependency-type: direct:production
dependency-group: pip
...
Signed-off-by: dependabot[bot] <support@github.com>
* feat: forward reasoning_effort to the backend so jinja models honor it
reasoning_effort was only mapped to the binary enable_thinking toggle and
otherwise reached Go-side templates — it was never sent to the backend. So
jinja-templated models whose chat template keys on reasoning_effort (gpt-oss
Harmony, LFM2.5) could not be driven by it: LFM2.5 ignores enable_thinking and
kept emitting <think>.
Forward the effective reasoning_effort to the backend as a chat_template_kwarg
(mirroring enable_thinking) in grpc-server.cpp, and put it in PredictOptions
metadata (gRPCPredictOpts). Add a config-level default: ModelConfig.reasoning_effort
and Pipeline.reasoning_effort, resolved by ModelConfig.ApplyReasoningEffort
(request value overrides config default, none->disable / level->enable, an
operator's reasoning.disable wins). request.go now uses that helper.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(realtime): set the pipeline LLM's reasoning_effort
Apply Pipeline.ReasoningEffort to the pipeline's LLM config when the realtime
model is built (per-session copy, overrides the LLM's own reasoning_effort),
and surface the resolved effort on the template input so Go-templated models
get it too. jinja models receive it via the backend metadata. This lets a
realtime pipeline disable thinking on models that only honor reasoning_effort
(e.g. LFM2.5), which enable_thinking can't.
Assisted-by: Claude:claude-opus-4-8 go test, golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The worker HTTP file-transfer server is authenticated by the registration
token via checkBearerToken, which fails open on an empty token: every
/v1/files, /v1/files-list and /v1/backend-logs request is then served
unauthenticated, granting read/write to the worker's models/staging/data
directories. The fail-open was also silent (the only auth log sat on the
unreachable reject branch), and the worker process never runs
DistributedConfig.Validate(), so the existing frontend warning did not
cover the component that exposes the server.
Mirror the NatsRequireAuth pattern: keep anonymous as the default but make
it loud and opt-in enforceable.
- Log a prominent warning when the file-transfer server starts tokenless.
- Add LOCALAI_REGISTRATION_REQUIRE_AUTH: DistributedConfig.Validate() errors
on an empty token (frontend) and the worker refuses to start (fail-fast,
before registration), so production can fail closed. Also satisfies the
F-003 suggestion to fail Validate() on distributed + empty token.
- Add LOCALAI_DISTRIBUTED_REQUIRE_AUTH umbrella switch implying both
RegistrationRequireAuth and NatsRequireAuth — one production knob locking
down the registration/file-transfer layer and the NATS bus together; the
granular flags remain available as single-layer overrides. Wired into the
frontend, supervisor worker, and agent worker (vLLM worker has neither a
NATS connection nor a file-transfer server, so it is left untouched).
- Document in distributed-mode.md (warning callout + flag tables).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): self-heal stale 'model not loaded' routing
In distributed mode the registry can list a model as loaded on a node
while the worker has evicted it (autonomous LRU eviction, an out-of-band
unload, etc.) yet the backend process survives. The router's cached-node
check only verifies the process is alive (probeHealth), so it routes there
and inference fails with "<backend>: model not loaded" — and stays broken
until the controller restarts and rebuilds its registry.
InFlightTrackingClient now reconciles this: when a tracked inference call
returns a model-not-loaded error, it drops the stale replica row
(RemoveNodeModel) so the next request reloads the model on a healthy node
instead of routing back to the evicted one. The original error is returned
unchanged; only the registry is corrected.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): typed model-not-loaded error via gRPC status code
Replace the controller-side error-string match with a shared, code-aware
helper. Go error types don't survive the gRPC boundary, so the signal is
carried as a status code (FailedPrecondition):
- pkg/grpc/grpcerrors: ModelNotLoaded(backend) constructor +
IsModelNotLoaded(err) checker (status-code first, message fallback for
backends not yet migrated).
- InFlightTrackingClient.reconcile now uses grpcerrors.IsModelNotLoaded.
- Migrate the Go backends that emit this error (parakeet-cpp, cloud-proxy,
rfdetr-cpp) to the typed constructor.
Acting on a false positive is harmless (the model is just reloaded).
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION already existed as
ModelConfigUsecase bitmask flags, and GuessUsecases already gate-checks
both backends by name — but BackendCapabilities had no entries for
either, so the UI could not classify them.
Also missing were the Method* constants for the five proto-defined RPCs
these backends implement (FaceVerify, FaceAnalyze, VoiceVerify,
VoiceEmbed, VoiceAnalyze) and the corresponding Usecase* strings
and UsecaseInfoMap entries needed to wire them into the rest of the
capability system.
Changes:
- Add MethodFaceVerify, MethodFaceAnalyze, MethodVoiceVerify,
MethodVoiceEmbed, MethodVoiceAnalyze GRPCMethod constants
- Add UsecaseFaceRecognition ("face_recognition") and
UsecaseSpeakerRecognition ("speaker_recognition") Usecase constants
- Add UsecaseInfoMap entries for both new usecases, referencing the
existing FLAG_FACE_RECOGNITION and FLAG_SPEAKER_RECOGNITION flags
- Register insightface: Embedding + Detect + FaceVerify + FaceAnalyze
- Register speaker-recognition: VoiceVerify + VoiceEmbed + VoiceAnalyze
Follows up on #10107 which left these two out because they needed new
constants first.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Adira Denis Muhando <dennisadira@gmail.com>
Distributed file-staging treated every model path field (ModelFile, etc.)
as a single regular file: it os.Open'd the path and streamed its fd as the
HTTP PUT body. For directory-based models — e.g. qwen3-tts-cpp, whose
weights and tokenizer ggufs live under one directory referenced by
parameters.model — opening the directory succeeds but reading its fd
returns EISDIR, so routing the model to a remote NATS worker failed with
"read /models/<model>: is a directory". Single-file models were unaffected,
so only multi-file pipelines (e.g. the realtime TTS stage) broke.
stageModelFiles now detects a directory path field and stages each
contained file individually (via the new stageDirectory helper), preserving
structure with the existing StagingKeyMapper and rewriting the field to the
remote directory (deriving ModelPath as before). countStageableFiles makes
the progress total count a directory's files so the staging tracker stays
accurate.
Assisted-by: Claude:claude-opus-4-8 go vet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The qwen3-tts.cpp backend honored the request `language` field only via exact lowercase two-letter codes in the C++ language_to_id table, silently defaulting to English for anything else (en-US, EN, english, ...).
Add normalizeLanguage() in the Go handler: lowercase + trim, strip the region/locale suffix (en-US, pt_BR, zh-Hans -> en/pt/zh), and resolve common English full names (english -> en). The canonical codes match the existing C++ table, so no C++ change is needed. Covered by a pure-Go Ginkgo spec. Also document the language field and accepted forms under the Qwen3-TTS docs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The OpenAI-compatible TTS endpoint accepts an `instructions` field, but it
was silently dropped at the HTTP->gRPC boundary: neither schema.TTSRequest
nor the gRPC TTSRequest proto carried it, so backends could only read such a
value from static YAML options (identical for every request). This blocked
per-line emotion/style and, for Qwen3-TTS VoiceDesign, limited a model config
to a single designed voice.
Plumb a generic per-request instruction string end to end, plus an optional
backend-specific params map:
- proto: add `optional string instructions` and `map<string,string> params`
to TTSRequest.
- schema: add Instructions (maps OpenAI `instructions`) and Params (LocalAI
extension) to schema.TTSRequest.
- core: thread both through ModelTTS/ModelTTSStream via a newTTSRequest helper
that attaches instructions only when non-empty (so backends can fall back to
YAML when unset); forward them from the /v1/audio/speech handler.
- qwen-tts: prefer the per-request instruction over the YAML `instruct` option
(used by both mode detection and generation) and merge per-request params.
- chatterbox: merge per-request params (coerced to float/int/bool) over YAML
options into generate() kwargs.
Fully backward compatible: empty instructions fall back to the YAML option and
backends that don't support style/voice instructions ignore the field.
Closes#10164
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): NATS JWT auth, TLS/mTLS options, and e2e coverage
Mint per-node NATS user JWTs at registration when LOCALAI_NATS_ACCOUNT_SEED
is set, and connect workers with scoped credentials from the register response.
Add optional LOCALAI_NATS_TLS_CA/CERT/KEY for private CA and mTLS alongside
tls:// URLs, plus test-e2e-distributed and NatsJWT container e2e specs.
Document JWT setup (nats-auth-setup.sh) and TLS env vars in distributed-mode.
Assisted-by: Grok:grok grok-build
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(distributed): correct NATS JWT scoping and harden client auth
The JWT-auth path added in 46467cc7 had several gaps that fail silently
under LOCALAI_NATS_REQUIRE_AUTH:
- Agent-worker minted JWTs did not allow the subjects the agent worker
actually subscribes to (jobs.mcp-ci.new and nodes.<id>.backend.stop),
so MCP-CI jobs and backend-stop session cleanup were silently dropped.
Scope the agent permission set to those subjects.
- NATS subscription permission violations were swallowed (Subscribe
returned a live-but-dead subscription). Confirm subscriptions with a
server round-trip so a denial surfaces synchronously, and log async
permission errors.
- The backend worker connected anonymously when given a JWT without its
paired seed; reject the unpaired credential instead.
- The documented service-user permissions in nats-auth-setup.sh omitted
prefixcache.>, which the frontend publishes and subscribes; add it.
Also: add a credential-provider hook to the messaging client (consumed by
the follow-up credential-lifecycle change), drop the always-nil error from
NatsMessagingOptions, run go mod tidy (jwt/v2 and nkeys are now direct),
and gofmt the feature's files.
Tests: an agent-JWT e2e spec that connects to the enforcing NATS server
and exercises every subscription the agent worker makes, plus permission
allow-list coverage unit tests.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(distributed): acquire and auto-refresh worker NATS credentials
Workers fetched NATS credentials once at startup, which broke two cases
under JWT auth: a worker that registered while still pending admin
approval never received a minted JWT (it connected unauthenticated and
gave up), and a long-running worker's 24h JWT expired with no way to renew
it.
Introduce workerregistry.NATSCredentialManager, built on idempotent
re-registration (the frontend preserves the node row and mints a fresh JWT
each call):
- Acquire re-registers through admin approval until the node is approved
and credentials are minted (or returns the first success when auth is
not required, preserving anonymous-NATS behavior).
- RefreshLoop re-registers before the JWT expires (~75% of its lifetime),
updating the credentials served to the connection.
- Both are bounded (default 100 attempts / consecutive failures) and
return an error on exhaustion, so an unapprovable or unrenewable worker
exits non-zero and surfaces the problem instead of hanging or drifting
toward an expired credential.
The messaging client gains WithUserJWTProvider, fetching credentials on
each (re)connect so the connection transparently adopts a refreshed JWT
when the server expires the old one. RegisterFull exposes the approval
status and full response; Register delegates to it.
Both the backend worker and the agent worker are wired to this: explicit
env credentials are used as-is, minted credentials are acquired-with-wait
and refreshed, and a permanent refresh failure shuts the worker down so it
restarts and re-acquires.
Tests cover Acquire (wait-through-pending, bounded give-up, context
cancel), RefreshLoop (refresh-before-expiry, bounded failure, no-expiry
exit) and jwtExpiry decoding. Docs updated in distributed-mode.md.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The direct (non-batched) transcription path handed the original upload
path straight to the C library via parakeet_capi_transcribe_path_json.
That loader only understands 16 kHz mono WAV/PCM, so any other format
(MP3, etc.) failed with "parakeet: failed to load audio: <file>".
Only the batched path converted the input (via decodeWavMono16k ->
utils.AudioToWav). Every other audio backend (whisper, crispasr)
converts unconditionally with utils.AudioToWav before handing the file
to its engine; the parakeet-cpp fallback was the lone exception.
Extract a convertToWavMono16k helper (reused by decodeWavMono16k) that
produces a 16 kHz mono WAV in a temp dir, and run the non-batched path
through it before calling the C loader. WAV inputs already in the target
format are passed through without ffmpeg.
Add specs covering the helper (decodable copy + cleanup, and an error on
a missing input) that need neither the model, the C library, nor ffmpeg.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
docs: fix distributed-mode diagram - workers coordinate via NATS, not PostgreSQL
The architecture diagram drew the worker-bound arrows from the PostgreSQL area of the control plane, implying workers connect to PostgreSQL. They do not: PostgreSQL is the frontends shared state, while workers coordinate over NATS (backend.install events) and receive LoadModel over gRPC from a frontend. Re-route the worker arrows to originate from the NATS chip.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* docs: add 'how LocalAI works' architecture diagram
Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: add blueprint diagrams across feature, distributed & getting-started docs
Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.
Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: add composable-core diagram to README hero
Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs: fix composable-core connectors/badge and federated-vs-worker layout
- composable-core: thicken the plug-in connectors so they read clearly, and
widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
replace the tangled node-to-node activation arrows with a clean fan-out
(request split across all sharded nodes), mirroring the federated panel.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Reframe the README hero and docs (homepage, overview, FAQ) around the
composable architecture: a small core, with backends built as dedicated
gRPC services around best-in-class engines, shipped as separate OCI
images and pulled on demand. Lead from strength: drop the "36+ backends"
kitchen-sink framing and the "All-in-One Complete AI Stack" / "single
binary that gives you everything" lines that read as a monolith.
- README: small-core differentiator; composable + open/extensible bullets
- _index.md: composable tagline; install only what you use
- overview.md: core vs on-demand backends; gRPC/OCI mechanics as benefits;
bring-your-own model and backend
- faq.md: "Do I need to install all the backends?" and
"Can I bring my own model or backend?"
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The split_mode: tensor description claimed tensor parallelism requires
KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that
restriction by extending the meta backend to preserve shape information
through KV-cache flatten/reshape, so cache_type_k/cache_type_v
quantization can be combined with -sm tensor on builds that include it.
Documentation only: no backend code, grpc-server.cpp comment, or
llama.cpp pin changes.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): dynamic-batching scheduler (queue + dispatcher)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): dynamic batching for AudioTranscription via batched JSON C-API
Drop SingleThread; route unary transcription through the in-process batcher
which coalesces concurrent requests into one batched engine call. Streaming
stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms
options (size=1 disables; recommended on CPU).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): tear down dispatcher in Free; log batch config; preallocate; clarify stream lock
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): Ginkgo batcher tests; optional batch C-API binding with per-request fallback
The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2);
probe it with Dlsym and register optionally so the backend still loads against
an older library, falling back to per-request transcription. Rewrites the
batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): debug-log coalesced batch size in runBatch
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)
Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* chore(parakeet-cpp): bump parakeet.cpp to 8a7c482 (batched decode + B=1 fast-path)
parakeet.cpp PR #1 merged the batched encoder/decode and the B=1 encoder
fast-path to master. Point PARAKEET_VERSION at that commit so the backend
builds the batched C-API (parakeet_capi_transcribe_pcm_batch_json) that the
dynamic batcher calls; the prior pin (30a3075) predated it, so only the
per-request fallback path was exercised. Verified the shared lib builds with
the backend's CMake flags and exports the batch symbol.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Pin texterrors==1.1.6 before nemo_toolkit[asr] in requirements-cublas13.txt.
The texterrors package (a NeMo transitive dependency) contains a compiled
C++ extension (texterrors_align.so) that may be built from source during
OCI image creation. When built on systems with GCC 14+ (e.g. Ubuntu 24.04),
the resulting binary requires GLIBCXX_3.4.32, which is not available in
the default LocalAI container (Ubuntu 22.04, GLIBCXX up to 3.4.30).
Pinning to 1.1.6 (the latest release) ensures:
- Reproducible builds across environments
- pip resolves the pre-built manylinux2014 wheel (needs only GLIBCXX_3.4.11)
instead of potentially building from source with a newer toolchain
Fixes#10056
Signed-off-by: 番茄摔成番茄酱 <fqscfqj@outlook.com>
The UI coverage gate was tightened to 0.1pp against a fast-local
measurement (39.86% baseline); CI's slower runners measure ~0.9pp lower,
so tests-ui-e2e failed there. UI e2e coverage is diffusely
non-deterministic and tracks machine speed — a 0.1pp band can't hold
across environments.
Rather than loosen the gate, raise the floor under it: a render-smoke
spec mounts each lazy page (navigate + assert the header renders),
covering a dozen previously-untested pages and lifting coverage from
~39% to ~42.7% locally. Restore the tolerance to 0.8pp and set the
baseline conservatively (40.0), below the slow-CI floor, so the ratchet
holds without flapping.
Document the coverage policy — install the git hooks and don't bypass
them (no --no-verify, no hand-lowering the baseline or widening the
tolerance); raise coverage by adding tests instead; set the UI baseline
below the slow-CI floor — in AGENTS.md, CONTRIBUTING.md and
.agents/building-and-testing.md.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Move ReplicaCandidate and PickBestReplica out of core/services/nodes (which depends on gorm) into a new dependency-light leaf package pkg/clusterrouting, so the p2p federation server can later share the same replica-selection policy without pulling in a database driver.
core/services/nodes keeps a type alias and a thin delegator, so every existing reference (the LoadedReplicaStats interface method, the ReplicaCandidate row conversion in registry.go, and the SQL policy-mirror test) compiles and behaves unchanged. This is a pure, behavior-preserving refactor: the full nodes suite, including the policy-mirror spec that pins the SQL ORDER BY to PickBestReplica, stays green.
Assisted-by: Claude Code:claude-opus-4-8
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore(localvqe): update backend to v1.3, add v1.2/v1.3 gallery models
Bump the LocalVQE backend pin 72bfb4c6 -> b0f0378a, which adds the v1.2
(1.3 M) and v1.3 (4.8 M) GGUF SHA-256s to the upstream released-models
allowlist (and the arch_version=3 loader) so both load without
LOCALVQE_ALLOW_UNHASHED.
Add gallery entries for localvqe-v1.2-1.3m and localvqe-v1.3-4.8m
(SHA-256 verified against the downloaded weights) and update the
audio-transform docs to make v1.3 the current default while noting the
compact v1.1/v1.2 alternatives.
Assisted-by: Claude:claude-opus-4-8 Claude-Code
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* chore(flake): add ffmpeg-headless to the dev shell
pkg/utils/ffmpeg_test.go shells out to the `ffmpeg` CLI, and the
pre-commit gate runs those tests via `make test-coverage`. Without
ffmpeg in the dev shell the gate fails with "executable file not found
in $PATH". The headless build provides the CLI without GUI/X deps.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(localvqe): parse WAV by walking RIFF sub-chunks
Walk the RIFF chunk list instead of assuming the canonical 44-byte
header layout. Real inputs (browser-recorded clips, ffmpeg output with
an 18/40-byte extensible `fmt ` chunk or trailing LIST/INFO metadata)
would otherwise splice header/metadata bytes into the PCM stream as an
audible impulse. Honour the `data` chunk size and validate that both
`fmt ` and `data` chunks are present.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(security-headers): allow blob: in connect-src for waveform fetch
The waveform renderer XHRs/fetches a freshly-created blob: object URL
(e.g. an uploaded or enhanced clip before it has a server URL). XHR/fetch
of blob: is governed by connect-src, not media-src, so it was blocked by
the CSP. Add blob: to connect-src.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(react-ui): add input/output spectrogram view to AudioTransform
The transform page only showed time-domain amplitude waveforms, so you
could see how loud a clip was but not which frequencies the model
touched. Add a time x frequency spectrogram heatmap and render the input
and output spectrums side by side, so it's visible which bands the
enhancement attenuates (bright input bands that go dark in the output).
Computed client-side via a Hann-windowed STFT over both clips (a small
dependency-free radix-2 FFT), defaulting to the LocalVQE 512/256 frame
geometry. This shows the net input->output spectral change; the model's
internal gain mask is not exposed by the backend.
- src/utils/fft.js radix-2 FFT
- src/hooks/useSpectrogram.js decode + STFT -> normalised dB magnitude grid
- src/components/audio/Spectrogram.jsx canvas heatmap (magma colormap)
- AudioTransform.jsx dual-spectrogram panel + CSS
- e2e spec + UI coverage baseline bump (38.29 -> 39.0; measured ~39.4-40.2)
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(react-ui): make UI coverage deterministic, tighten the gate
UI e2e line coverage swung ~1pp run-to-run (39.1% <-> 40.2%), which forced
a loose 0.8pp tolerance on the monotonic gate — a band wide enough to let
a real ~300-line regression through silently. The swing was a bug, not
inherent jitter: the 'Create Agent navigates' spec ended on the URL
assertion, so AgentCreate.jsx's ~400 lines were collected only when its
render happened to beat the coverage teardown.
Wait for the page to actually render (assert its heading) so those lines
are covered every run. With the race gone, repeated runs land within
~0.013pp of each other, so:
- tighten UI_COVERAGE_TOLERANCE 0.8 -> 0.1 (noise floor, not a drift band)
- set the baseline to the real, reliably-achieved value (39.0 -> 39.86)
Localised by running the V8-coverage suite repeatedly and diffing per-file
line coverage; AgentCreate.jsx was the sole ~1pp flipper.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
fix(parakeet-cpp): forward PARAKEET_GGML_* so cublas/hipblas/vulkan builds aren't silently CPU-only
parakeet.cpp gates its GGML backends behind PARAKEET_GGML_CUDA/HIP/VULKAN and
does set(GGML_CUDA ${PARAKEET_GGML_CUDA} CACHE BOOL "" FORCE), which overwrites
a bare -DGGML_CUDA=ON back to OFF. So the backend's BUILD_TYPE=cublas (and hipblas,
vulkan) produced a CPU-only libparakeet.so. Forward the PARAKEET_GGML_* options
instead. Verified on a GB10 (CUDA 13): the lib now links libcudart/libcublas and
registers the CUDA backend, vs a CPU-only lib before.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Large model GGUFs (multi-GB) transferred between master and worker over
flaky / bandwidth-throttled paths (e.g. libp2p relays with byte caps) used
to restart from byte 0 on every transport error. This change adds standard
HTTP Range/resume semantics to the worker's PUT /v1/files/<key> endpoint
and teaches the master-side HTTPFileStager to consult the worker for the
last accepted offset and resume from there.
Server side (file_transfer_server.go):
- PUT now honors Content-Range: bytes <start>-<end>/<total>. The handler
validates that <start> matches the current on-disk size; mismatches
return 416 with the actual size in X-File-Size.
- Mid-upload chunks return 308 Permanent Redirect ("Resume Incomplete")
with the new size, so the client can keep going.
- An optional X-Content-SHA256 request header binds an upload to a target
hash; cross-attempt drift returns 409. On the final chunk the server
re-computes SHA-256 and returns 400 if it doesn't match.
- HEAD now advertises Accept-Ranges: bytes and Content-Length, and exposes
X-Target-SHA256 for in-progress files (so clients can resume only when
the partial bytes belong to the file they want to upload).
- Legacy PUTs with no Content-Range keep the original truncate-create
semantics — zero behavior change on the happy path.
Client side (file_stager_http.go):
- Pre-PUT HEAD probe reads X-File-Size + X-Target-SHA256 to determine the
resume offset.
- doUpload seeks to that offset and sends Content-Range + X-Content-SHA256.
- Retry loop switches from fixed 3 attempts / 5s-10s-20s backoff to an
outer time budget
with exponential backoff (1s -> 30s cap), so a 5GB upload over a flaky
link can outlast many short disconnects.
- 308 and 416 responses are treated as transient: the next iteration
re-HEADs to learn the correct offset.
Tests:
- Two-chunk Content-Range round-trip produces the correct file + sidecar.
- 416 on a Content-Range/file-size mismatch.
- 409 on X-Content-SHA256 drift between chunks.
- 400 on final-hash mismatch.
- HEAD on a partial upload exposes X-Target-SHA256 (not a misleading
hash-of-partial-bytes via X-Content-SHA256).
- Pre-existing finished file with a different hash is transparently
overwritten when a new PUT starts at byte 0.
- End-to-end resume: EnsureRemote against a worker that already holds a
partial file transfers only the remainder.
- Mid-stream connection drop on attempt #1 is recovered by attempt #2
resuming from the partial offset.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): skip begin-of-stream null partial in PredictStream
Upstream llama.cpp (ggml-org/llama.cpp#23884), pulled in by this bump,
now emits an initial "begin" partial whose to_json() returns null. It
exists only to signal the HTTP layer to flush 200 status headers before
any token is produced.
gRPC has no such concept, and PredictStream had no guard: the null result
was fed straight into build_reply_from_json, which threw an uncaught
exception. That surfaced as a generic "Unexpected error in RPC handling"
and the task was cancelled the instant it launched, breaking the
PredictStream e2e spec.
Skip null results in both the first-result handling and the streaming
loop, mirroring upstream's own `if (first_result_json == nullptr)` guard.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
In LocalAI distributed mode the master streams a model GGUF to a
worker on first inference. On bandwidth-constrained cluster networks
(libp2p circuit-v2 relays under NAT, double-NAT residential, slow
overlays) that transfer can be slow or unreliable — meanwhile each
worker's outbound internet is usually fine.
LOCALAI_PREFETCH_MODELS lets the operator name gallery model IDs to
download at worker boot, BEFORE the worker subscribes to backend.install
events. Reuses gallery.InstallModelFromGallery so the on-disk /models
layout matches what the master would have pushed, and the master can
still push files on demand if the gallery is unreachable at boot
(prefetch is non-fatal on every error path).
The installer is wrapped in a function-value indirection so tests can
swap a fake without touching the real gallery; production never
reassigns the binding.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): backend source files (Go gRPC server, C-ABI shim, build files)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* polish(crispasr): brand error strings + fix stale shim comment
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* build(crispasr): register backend in root Makefile
Mirror the whisper Go backend registration for the new crispasr
backend: NOTPARALLEL entry, prepare-test-extra/test-extra hooks,
BACKEND_CRISPASR definition, docker-build target generation, and the
docker-build-backends aggregate target.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(crispasr): add backend build matrix entries
Mirror the 11 whisper golang Dockerfile matrix entries (CPU amd64/arm64,
CUDA 12/13, L4T CUDA 13, Intel SYCL f32/f16, Vulkan amd64/arm64, L4T
arm64, ROCm hipblas) with backend and tag-suffix substituted to crispasr.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add crispasr backend gallery entries
Add the crispasr meta anchor and its full set of image gallery entries
(cpu, metal, cuda12/13, rocm, intel-sycl f32/f16, vulkan, L4T arm64,
L4T cuda13 arm64, plus -development variants), mirroring the whisper
backend gallery block.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(crispasr): bump CRISPASR_VERSION via bump_deps workflow
Track CrispStrobe/CrispASR main branch and bump CRISPASR_VERSION in
backend/go/crispasr/Makefile.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* build(crispasr): don't wire fixture-gated test into test-extra
Mirror the whisper Go backend: its AudioTranscription test is gated on
model/audio fixtures and skips in CI, so building crispasr (the heaviest
ggml compile in the tree) inside the unit-test lane adds a long compile
for zero coverage. The backend image build in backend-matrix.yml remains
the authoritative compile check.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(crispasr): add darwin metal build entry (mirror whisper)
The metal-crispasr gallery entries and capabilities.metal mapping
reference -metal-darwin-arm64-crispasr, which is only produced by an
includeDarwin entry. Mirror whisper's darwin metal entry so the tag
actually gets built.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(crispasr): place hipblas matrix entry next to whisper twin
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): register crispasr as pref-only ASR backend + test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(crispasr): port whisper behavioral suite (cancellation + streaming)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(crispasr): fix skip message env var names to CRISPASR_*
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): switch shim to crispasr_session_* multi-architecture API
The shim used whisper_full(), which in CrispASR is the whisper-only path:
libcrispasr only transcribes Whisper GGUFs through it. Multi-architecture
transcription (Parakeet, Voxtral, Qwen3-ASR, Canary, Granite, FunASR,
Paraformer, SenseVoice, ...) goes through the crispasr_session_* C-ABI,
which auto-detects the architecture from the GGUF and dispatches to the
matching backend.
Rewrite the C shim around crispasr_session_open / _transcribe_lang /
_result_* and add get_backend() so the selected backend is logged.
load_model now takes a threads param (session_open binds n_threads at
open). The session result is segment+word based with no token IDs and no
per-decode callback, so drop n_tokens / get_token_id /
get_segment_speaker_turn_next / set_new_segment_callback. set_abort is
kept for API parity but is best-effort: the session transcribe is blocking
with no abort hook.
Update the purego bindings and gocrispasr.go to match: tokens are left
empty, speaker-turn handling is removed, and AudioTranscriptionStream
emits one delta per non-empty segment after the blocking decode returns
(no progressive streaming via the session API), preserving the
concat(deltas) == final.Text invariant.
crispasr_session_set_translate is exported by libcrispasr but not declared
in crispasr.h, so it is forward-declared in the shim alongside the
open/transcribe/result functions.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* build(crispasr): link full CrispASR backend set for multi-arch support
The shim's crispasr_session_* dispatch calls into the per-architecture
backend libs (parakeet, voxtral, qwen3_asr, canary, funasr, paraformer,
sensevoice, ...), which CrispASR builds as static archives. Linking only
crispasr + ggml dead-stripped every backend object from the final module
(nm backend-symbol count: 0), leaving a whisper-only .so.
Link the same backend set as crispasr-cli so the static archives are
pulled in. After this the module carries the backend symbols (nm count
407, .so grows from ~2.1MB to ~6.7MB) and the session API can dispatch to
every compiled-in architecture.
Also rewrite ${CMAKE_SOURCE_DIR}/examples/talk-llama to
${PROJECT_SOURCE_DIR}/... in the vendored src/CMakeLists.txt: CrispASR
locates its vendored llama.cpp via ${CMAKE_SOURCE_DIR}, which is wrong when
CrispASR is add_subdirectory'd (CMAKE_SOURCE_DIR points at this backend
dir, not the CrispASR root). PROJECT_SOURCE_DIR is correct both standalone
and as a subproject; the sed is idempotent.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(crispasr): adapt suite to session API (blocking, no decode callback)
Register the new symbol set (drop the removed token/speaker/callback funcs,
add get_backend; load_model now takes 2 args). The session transcribe is
blocking with no abort hook, so a mid-decode cancel can't interrupt it:
change the cancellation spec to cancel the context before the call and
assert codes.Canceled from the pre-call ctx.Err() check, dropping the
<5s mid-decode timing assertion. The streaming spec still holds with
per-segment post-decode emission (>=2 deltas, concat(deltas) == final.Text).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add CrispASR ASR model entries (-crispasr)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(gallery): keep only session-auto-detectable CrispASR ASR models
The crispasr backend loads models via crispasr_session_open, which
auto-detects the backend from the GGUF general.architecture using
crispasr_detect_backend_from_gguf. Architectures not in that detect
map cannot be opened, so those gallery entries fail to load.
Removed entries whose architecture is not wired into CrispASR
v0.6.11's session auto-detect router (they can be re-added when
upstream maps them):
- Not in the detect map: data2vec, firered-asr, funasr,
fun-asr-mlt-nano, glm-asr, hubert, kyutai-stt, mega-asr, mimo-asr,
moonshine{,-de,-streaming,-tiny-de}, omniasr{,-llm,-llm-1b},
paraformer, sensevoice.
- Pending verification (filename-heuristic routed, not arch-detected):
parakeet-ctc-0.6b, parakeet-ctc-1.1b. Their GGUFs are routed to the
fastconformer-ctc backend by a filename heuristic in the model
registry, which implies general.architecture is not a mapped string.
Kept the parakeet rnnt/tdt_ctc variants: convert-parakeet-to-gguf.py
writes general.architecture="parakeet" unconditionally and encodes the
rnnt/ctc distinction in metadata fields, so they session-auto-detect.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): TTS synthesis via crispasr_session_synthesize (24kHz)
Add tts_synthesize/tts_free/tts_set_voice to the C-ABI shim. They reuse
the already-open g_session (crispasr_session_open auto-detects a TTS
model) and dispatch to the upstream synthesis call, which returns
malloc'd 24 kHz mono float PCM. Orpheus needs a SNAC codec path that we
do not set, so it returns NULL here and surfaces as an error Go-side.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): implement TTS/TTSStream gRPC methods
Bind the new shim functions via purego and implement TTS, TTSStream and
a writeWAV24k helper. synthesize copies the C-owned PCM out before
freeing it; TTS writes a 24 kHz mono 16-bit WAV to req.Dst via
go-audio/wav. CrispASR has no progressive synth, so TTSStream
synthesizes fully, encodes to WAV, and emits the bytes as a single
chunk; it owns the results-channel close (the gRPC server wrapper ranges
until close), mirroring vibevoice-cpp's TTSStream.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): log when a TTS voice override is not honored
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add CrispASR vibevoice-tts model entry
Only vibevoice-tts works through the current shim: qwen3-tts, chatterbox,
and orpheus require companion codec/s3gen/SNAC paths (set_codec_path /
set_s3gen_path) that the shim doesn't wire yet, and kokoro/indextts/voxcpm2
aren't in the session auto-detect map. Those are follow-ups.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(crispasr): gated TTS synthesis spec
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(crispasr): satisfy golangci-lint (errcheck defers + unsafeptr nolint)
The crispasr Go file is entirely new, so new-from-merge-base lints every
line (unlike the grandfathered whisper backend it was forked from):
- handle os.RemoveAll / fh.Close return values in AudioTranscription
- annotate the two intentional C-pointer unsafe.Slice sites with //nolint:govet
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): backend: and codec: model options (explicit arch + companion files)
Add two model-config options to the CrispASR backend via opts.Options:
- backend:<name> selects an explicit CrispASR backend (bypassing
auto-detect) by routing load_model through
crispasr_session_open_explicit, unlocking architectures the
detector won't pick on its own (qwen3, cohere, granite, voxtral,
moonshine, mimo-asr, orpheus, kokoro, chatterbox, etc.).
- codec:<path> loads a companion file (qwen3-tts codec, orpheus SNAC,
chatterbox s3gen, or mimo-asr tokenizer) via the universal
crispasr_session_set_codec_path setter after the session opens. A
relative path resolves against the model directory. rc==0 means
success or not-applicable; only a negative rc is fatal.
The C shim load_model gains a backend_name argument and a new
set_codec_path entry point; the Go bridge parses the prefix:value
options and registers the new symbol. The vad_only path is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): expand CrispASR models via backend:/codec: options (explicit arch + companions)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(gallery): use virtual.yaml base for crispasr models
The crispasr entries are just backend + model + a couple options, fully
expressed inline via overrides:/files: in gallery/index.yaml. Point each
url: at the shared gallery/virtual.yaml (the established 'virtual' model
trick) and drop the 36 redundant per-model gallery/*-crispasr.yaml files.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(gallery): drop voice-requiring TTS entries (keep vibevoice-tts)
Real e2e showed qwen3-tts/orpheus/chatterbox don't synthesize through the
current shim: the codec: companion loads fine, but these engines additionally
need a voice pack / voice prompt / reference clip (qwen3-tts base errors
'no voice'; chatterbox is zero-shot cloning; orpheus uses named voices) that
the backend doesn't wire. (qwen3-tts also can't auto-detect: its GGUF arch is
'qwen3tts', unmapped by the detector — would need backend:qwen3-tts.) Removed
to avoid shipping non-working gallery entries; vibevoice-tts (built-in voice,
e2e-verified) remains the working TTS. Voice-pack wiring is a follow-up.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(crispasr): speaker: and voice: TTS options (baked speakers + voice packs/prompts)
speaker:<name> -> crispasr_session_set_speaker_name (baked speakers: qwen3-tts
CustomVoice, orpheus). voice:<path>(+voice_text:<ref>) -> crispasr_session_set_voice
(voice-pack GGUF, or WAV zero-shot clone with ref text). Applied at Load as the
default voice; req.Voice still overrides the speaker per request.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): re-add e2e-verified TTS engines (chatterbox, qwen3-tts-customvoice, orpheus)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
parakeet-cpp was added in #10084 but not registered in
BackendCapabilities, so GuessUsecases only allowed "whisper" for
FLAG_TRANSCRIPT and the UI could not classify parakeet-cpp models as
speech-to-text. The result was that parakeet models appeared only in
the LLM selector in the speech-to-speech pipeline, making them
unusable for transcription through the UI.
Closes#9718
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Cross-referencing backend/ directories against BackendCapabilities found
five backends that exist and work but have no entry in the map, so
GuessUsecases falls back to heuristics that mis-classify them (e.g.
a TTS backend appears as an LLM in the UI).
Added entries, each modelled on the corresponding Python twin or the
nearest equivalent already in the map:
sglang — LLM (Predict/PredictStream/TokenizeString, vision)
vibevoice-cpp — ASR + TTS/TTSStream (mirrors vibevoice Python)
sherpa-onnx — ASR + TTS/TTSStream + VAD (multi-model toolkit)
qwen3-tts-cpp — TTS (mirrors qwen-tts Python)
rfdetr-cpp — object detection (mirrors rfdetr Python)
Found by diffing `ls backend/{go,python}/` against the keys in
BackendCapabilities. Remaining gaps (insightface, speaker-recognition,
sam3-cpp) use custom gRPC methods not yet in the Method* constants —
left for a follow-up.
Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(ds4): add standalone ds4-worker distributed worker binary
Add worker_main.c, a minimal standalone worker that owns a slice of the
model's transformer layers and serves activations over ds4's own TCP
transport via ds4_dist_run(). It links the same engine objects the
backend already builds (including ds4_distributed.o) and has NO
gRPC/protobuf dependency, so it builds even on hosts lacking protobuf/grpc
dev headers. Launched by `local-ai worker ds4-distributed`.
Wire the ds4-worker CMake target (mirrors grpc-server's object/GPU/native
handling) and have the Makefile copy + clean the binary alongside
grpc-server. Ignore the built ds4-worker artifact.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(ds4): package ds4-worker alongside grpc-server
Copy the standalone ds4-worker binary into the backend package (Linux
package.sh) and the Darwin OCI tar (ds4-darwin.sh: both the explicit copy
and the otool dylib-bundling loop) so distributed workers ship with the
backend.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(ds4): tighten ds4-worker integer arg validation to match upstream
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(ds4): wire grpc-server as distributed coordinator
Add distributed COORDINATOR support to the ds4 backend's gRPC server.
Distributed inference is an engine backend: when LoadModel receives
'ds4_role:coordinator', the process populates ds4_engine_options.distributed
(role, layer slice, listen host/port) before ds4_engine_open, then the normal
ds4_session_* generation path runs transparently once the worker route covers
all layers.
- New LoadModel options: ds4_role, ds4_layers (START:END or START:output),
ds4_listen (host:port), ds4_route_timeout.
- parse_layers_spec() maps the layer spec onto ds4_distributed_layers.
- wait_route_ready() blocks generation until
ds4_session_distributed_route_ready() reports full coverage (or timeout),
gating both Predict and PredictStream; returns UNAVAILABLE on timeout/error.
- No ds4_role => g_distributed stays false and wait_route_ready is a no-op,
so single-node behavior is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(ds4): don't block Status during route wait; validate coordinator opts
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(cli): add ds4-distributed worker exec helper
Add the ds4WorkerArgs helper plus findDS4Backend/DS4Distributed.Run that
resolve the ds4 backend via the gallery and exec the packaged ds4-worker
binary. Unlike worker_llamacpp.go, ds4 bundles its own dynamic loader
(lib/ld.so) for glibc compatibility, so when present we exec ds4-worker
through that loader with LD_LIBRARY_PATH=<backend>/lib, mirroring
backend/cpp/ds4/run.sh; otherwise we exec it directly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(cli): register the ds4-distributed worker subcommand
Wire DS4Distributed into the Worker kong command tree so
`local-ai worker ds4-distributed` is available.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(ds4): document layer-split distributed inference
Add a ds4 section to the distributed-mode feature docs (coordinator
model YAML, manual worker command, layer-range semantics, the
'GGUF on every machine' requirement, coordinator-listens dial
direction vs llama.cpp) and a terse Distributed mode section to the
ds4 backend agent guide.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* test(ds4): opt-in hardware-gated distributed e2e spec
Add a self-contained, opt-in Ginkgo spec to the backend e2e suite that
spins a ds4 coordinator (via the packaged run.sh, loaded with
ds4_role/ds4_layers/ds4_listen options) plus a ds4-worker process for
the upper layers, then uses Eventually to assert a short successful
Predict once the layer route forms, before tearing the worker down.
Gated by BACKEND_TEST_DS4_DISTRIBUTED=1 (plus the existing
BACKEND_BINARY + BACKEND_TEST_MODEL_FILE and optional layer/listen/accel
knobs); compiles and skips cleanly with no env, hardware, or model.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* test(ds4): pass coordinator ctx to worker; lowercase error string
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(ds4): note distributed transport is plaintext/unauthenticated
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* style(ds4): replace em dashes in distributed docs/agent/test per repo convention
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(ds4): link ds4-worker with the C++ driver for CUDA/Metal builds
The ds4-worker target is built from worker_main.c (C), so CMake linked it
with the C driver. The nvcc-built ds4_cuda.o (and Obj-C++ ds4_metal.o)
reference the C++ runtime, so the CUDA/Metal builds failed with undefined
libstdc++ symbols (std::__throw_length_error). The CPU build passed because
ds4_cpu.o is pure C. Force LINKER_LANGUAGE CXX so libstdc++ is linked.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): generic prefix tree skeleton with longest-match
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): Insert with path recency refresh and entry cap
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): TTL idle-expiry and Evict sweep with branch pruning
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): recency-weighted per-value Weight
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(radixtree): Remove all entries for a value
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(radixtree): race-free concurrency smoke test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(radixtree): reclaim empty branches, RWMutex reads, TTL boundary, empty-key guard
Address review findings on the generic prefix tree:
- Extract a shared pruneWalk helper parameterized by a shouldClear
predicate and use it from Evict, Remove, and the MaxEntries path.
Previously evictOldestLocked cleared a victim's value but never
removed the now value-less node or its childless ancestors, so
internal nodes accumulated under sustained churn at the cap. The
MaxEntries path now prunes the victim and its empty ancestors.
- DRY: pruneWalk replaces the duplicated logic in the former
pruneLocked and Remove's inner closure.
- Switch Tree.mu to sync.RWMutex; LongestMatch, Weight and Len take
the read lock (RLock) while Insert, Evict and Remove keep the write
lock. Confirmed race-clean under go test -race.
- Document the strict greater-than TTL boundary on Options.TTL and
expired: age exactly equal to TTL is still live.
- Guard Insert against an empty key (no-op): the root never holds a
value.
Adds Ginkgo specs covering MaxEntries eviction, ancestor reclamation,
the no-growth-past-cap invariant, the TTL boundary, and empty-key
behavior for both Insert and LongestMatch.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): RoutePolicy enum with parse/resolve
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): Config with defaults and validation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): deterministic xxhash prefix-chain extractor
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): pure filter-then-score replica selection
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): Provider interface and radix-tree-backed Index
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(prefixcache): gofmt policy enum comment alignment
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): head-first prefix chunking and hoist Weight out of sort
Address code-quality review findings in the prefixcache package.
Correctness: ExtractChain now chunks from absolute offset 0 with fixed
[0,W),[W,2W),... boundaries and caps the chain to the FIRST MaxDepth
head blocks. The previous tail-keeping logic shifted the byte offset by a
non-window amount once a conversation grew past MaxDepth*WindowBytes,
changing every hash each turn and silently breaking cross-turn
longest-prefix matching. The reusable KV/prefix cache lives at the head
of the prompt, so anchoring at offset 0 makes the chain a true
prefix-chain: P and P+suffix share their full leading overlap. Add a
regression spec proving cross-turn stability past the cap.
Performance: Index.Decide precomputes each candidate's Weight once
(decorate-sort-undecorate) instead of calling the O(tree size) Weight
inside the O(n log n) sort comparator. Behavior is unchanged.
Lint: encode prev with binary.LittleEndian.PutUint64 instead of a manual
byte loop, clearing the modernize rangeint finding.
Also add a concurrent Decide/Observe/Invalidate spec to exercise Index's
documented concurrency safety under go test -race.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(messaging): prefixcache observe/invalidate subjects and payloads
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): NATS sync publish/apply for observe and invalidate
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributedhdr): ctx carrier for prefix-hash chain
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributedhdr): PrefixChainHook indirection for backend-side chain build
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backend): stash prompt prefix chain on ctx before distributed routing
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(backend): mirror modelID fallback for prefix-chain salt parity
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): scheduling config columns for prefix-cache routing
Add RoutePolicy and per-model balance/prefix-match override columns to
ModelSchedulingConfig and include them in the SetModelScheduling upsert
DoUpdates list so updates are not dropped on conflict.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): optional route preference in FindAndLockNodeWithModel
Add a RoutePreference type and a new pref parameter so the atomic
pick+lock+increment can be biased toward a preferred node without
weakening atomicity. A nil preference reproduces the previous ORDER BY
behavior exactly. Update the ModelRouter interface, both router.go call
sites (pass nil for now; Phase 5 builds the real preference), the test
doubles, and the distributed e2e caller.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): make Sync satisfy Provider with Evict
Sync.Observe now returns whether the local index treated the assignment as
new or extended, and Sync gains an Evict method that delegates to the wrapped
index. Together these let SmartRouter hold a single prefixcache.Provider that
broadcasts via NATS. Adds a compile-time Provider assertion and an
Evict-delegates behavioral test.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): prefix-cache-aware preference and observe in SmartRouter.Route
Add a PrefixProvider + PrefixConfig to SmartRouterOptions/SmartRouter (nil
keeps routing byte-for-byte the round-robin floor). On each request Route now
calls buildPreference: it reads the prompt prefix chain from ctx
(distributedhdr.PrefixChain), resolves the per-model policy/thresholds over
the global config, loads candidate replica in-flight via a new registry read
LoadedReplicaStats (deduped to one entry per node using the MIN in-flight
across that node's replicas), asks the provider to Decide, and runs
prefixcache.Select. The chosen node is passed as the RoutePreference to
FindAndLockNodeWithModel on all three pick paths (cache hit, locked re-pick,
cold scheduleAndLoad), and the served node is recorded via Observe only when
the resolved policy is prefix_cache so round-robin models never pollute the
tree.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): invalidate prefix-cache entries on unload and stale removal
UnloadModel and both staleness fall-through paths in Route (after a failed
gRPC probe and RemoveNodeModel) now call prefixProvider.Invalidate(model,
nodeID), guarded by a nil-provider check so the round-robin floor is
unchanged. At runtime the provider is the *prefixcache.Sync, so invalidations
also broadcast to peer frontends. Adds a test that a previously hot prefix no
longer Decides to a node after UnloadModel.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(prefixcache): rolling forced-disturb pressure counter
Add a concurrency-safe per-model rolling counter that tracks how many
times a request had a usable hot prefix match but the load guard forced
it off the warm node. Entries outside the window are dropped lazily on
Count so the backing slice stays bounded.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): autoscale on prefix-cache forced-disturb pressure
Wire the rolling forced-disturb counter into the SmartRouter and the
ReplicaReconciler.
Router: in buildPreference, after Decide + Select, record a forced-disturb
when a usable hot prefix match existed (d.HotNodeID != "" and
d.MatchRatio >= cfg.MinPrefixMatch) but Select chose a different node (or
nothing) because the load guard ruled the warm node out. This is the
scale-worthy signal: the cache-warm replica is saturated. It deliberately
does not fire for all-unique workloads (no hot match), avoiding
false-positive scale-ups. Pressure is optional on SmartRouterOptions; nil
keeps the path a no-op.
Reconciler: read the same Pressure instance in reconcileModel as an extra
scale-up reason, reusing the existing MaxReplicas + ClusterCapacityForModel
guards and the UnsatisfiableUntil cooldown that gates the whole method.
Pressure never overrides MaxReplicas and never force-evicts; a no-capacity
model does not spin. Window and threshold come from prefixcache.Config
(PressureWindow default 1m, PressureScaleThreshold default 1) and are
configurable via ReplicaReconcilerOptions.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): bound Pressure slice in Record; drop dead reconciler pressureWindow
Record now prunes entries older than the rolling window (the same prune
Count does), via a shared pruneLocked helper, so a model that takes
forced-disturb records but is never Counted (e.g. one with zero loaded
replicas the reconciler skips) no longer grows its backing slice
unbounded.
Also removes the dead pressureWindow struct field and the
ReplicaReconcilerOptions.PressureWindow option from the reconciler: they
were stored but never read (the window lives inside the *prefixcache.Pressure
instance). The scale block now reads pressure.Count once into a local.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): prefix-cache fields in scheduling endpoint DTO with validation
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): prefix-cache routing controls in node scheduling form
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): wire prefix-cache index, NATS sync, and config
Activates prefix-cache-aware routing in distributed mode. Builds the
prefixcache Index + NATS-backed Sync + Pressure counter, installs the
distributedhdr.PrefixChainHook so core/backend/llm.go attaches a prefix
chain per request, subscribes to prefixcache.observe/prefixcache.invalidate
to apply peers' events to the local index (no re-broadcast), threads
PrefixProvider/PrefixConfig/Pressure into the SmartRouter and
Pressure/PressureThreshold into the ReplicaReconciler, and runs a
background eviction ticker (every TTL/2) bound to the app context.
Enabled by default; --distributed-prefix-cache=false (LOCALAI_DISTRIBUTED_PREFIX_CACHE)
opts out and leaves the provider/pressure nil so routing stays round-robin.
--distributed-prefix-cache-ttl (LOCALAI_DISTRIBUTED_PREFIX_CACHE_TTL, default 5m)
controls entry idle-timeout and eviction cadence.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(nodes): round-robin-floor invariant for prefix-cache routing
Drives Select directly: a saturated hot node (in_flight 50 vs 0) is never
picked even with a perfect prefix match (round-robin floor holds), while a
balanced hot node within the load slack is reused.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(prefixcache): clear branch lint findings and em dashes
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): validate prefix-cache config at startup wiring
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(radixtree): single-walk WeightsFor for batch value weights
Add Tree.WeightsFor(values, now) which computes the recency-weighted
weight for many values in a single O(N + len(values)) tree traversal,
versus calling Weight once per value (O(len(values) * N)). Consumers
that score K candidates against the tree under the read lock no longer
pay K full walks.
Extract the per-entry contribution math into an unexported helper shared
by both Weight and WeightsFor so the metric stays identical (DRY).
Weight's public behavior is unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(config): add ModelConfig.ModelID() single source of truth
The c.Name fallback to c.Model was duplicated in core/backend/options.go
(feeding model.WithModelID) and hand-copied into core/backend/llm.go (the
prefix-chain salt). These MUST agree or the prefix-cache salt diverges
silently from the id the model loader tracks. Consolidate both into a new
config.ModelConfig.ModelID() helper and call it from both sites. Behavior
is identical.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(prefixcache): reuse one xxhash.Digest in ExtractChain
ExtractChain allocated a fresh xxhash.New() Digest per block (up to MaxDepth
per call) and grew the chain slice without preallocation. Reuse a single
Digest via Reset() before each block and preallocate the chain to
min(nBlocks, MaxDepth).
xxhash seed 0 is stateless, so Reset()+Write produces the byte-identical
value to a fresh New()+Write. Output hashes are unchanged, preserving the
cross-process determinism that peers rely on over NATS. Verified by capturing
ExtractChain output for the existing test inputs before and after the
refactor: identical. Existing extractor tests pass unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): drop hot match when matched node is not a candidate; weigh cold candidates in one walk
Index.Decide called radixtree.LongestMatch over the whole tree, so the
deepest match could be a node that is offline, unloaded, or simply not in
the passed candidate set. Honoring that as HotNodeID produced a false
forced-disturb signal upstream (buildPreference records pressure when
chosen != HotNodeID), making it look like a warm replica was load
saturated when it was actually absent.
Build the candidate set once and only set HotNodeID/MatchRatio when the
matched node is an actual candidate; otherwise fall back to cold
placement. A future refinement could ask the tree for the longest match
restricted to the candidate nodes (shallower-but-valid) instead of
dropping it.
Also replace the per-candidate tree.Weight call in the cold-order sort
with a single tree.WeightsFor walk, turning O(K*N) under the read lock
into O(N + K).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): remove Select's unreachable deterministic fallback
buildPreference always passes ColdOrder as a permutation of the full
candidate set, so the cold-order loop hits every eligible candidate. The
trailing best/bestIF scan was dead. Replace it with a plain "return """
and document that ColdOrder is guaranteed to cover all candidates, so ""
means none were eligible.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(nodes): fetch model scheduling config once per Route
GetModelScheduling was read three times per request - in
resolveSelectorCandidates, buildPreference, and nodeMatchesScheduling -
three DB round-trips for one row that is immutable for the life of the
request, and not a consistent snapshot. Fetch it once near the top of
Route and thread the *ModelSchedulingConfig (may be nil) into all three
helpers. scheduleNewModel keeps its own fetch since it runs outside the
Route snapshot. Behavior is identical for nil sched.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(autoscale): add Pressure.Reset to consume forced-disturb signal
Pressure.Count is non-draining (it prunes only by age), so a single burst
of forced-disturbs stays within the rolling window for the whole window and
keeps Count >= threshold on every reconciler tick. The reconciler will use
Reset to clear a model's events after acting on the signal so a fresh
scale-up requires fresh forced-disturbs to accumulate, rather than one burst
driving the model toward MaxReplicas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(autoscale): at most one scale-up per reconcile tick, consume pressure
Two autoscale bugs:
1. Over-scaling: the pressure scale-up block read Pressure.Count but never
consumed it. With a non-draining counter a single forced-disturb burst
kept Count >= threshold across the whole window, firing scaleUp on every
tick and pushing the model toward MaxReplicas off one transient burst.
After a successful pressure-triggered scale-up the reconciler now calls
Pressure.Reset to consume the signal.
2. Double scale-up in one tick: the all-replicas-busy block and the pressure
block could both fire in the same reconcileModel pass, each calling
scaleUp(+1) against the same `current` read once at the top, so a model
that was both busy and over threshold scaled +2 and could overshoot
MaxReplicas by one. A scaledUp flag now enforces at most one scaleUp(+1)
per tick: the pressure block is skipped if the busy block already scaled,
and scale-down is skipped in any tick that scaled up.
MinReplicas enforcement, UnsatisfiableUntil backoff, and capacity guards are
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): replica-removed chokepoint hook for prefix-cache invalidation
Add SetReplicaRemovedHook to NodeRegistry and fire it from both
RemoveNodeModel and RemoveAllNodeModelReplicas after a successful
delete. This is the single chokepoint every replica-removal path funnels
through (router eviction, reconciler scale-down, probe reaper,
health-monitor node-down reap, RemoteUnloaderAdapter), so the
prefix-cache index can be invalidated by construction rather than wiring
each call site individually.
The hook is stored in an atomic.Pointer so the startup wiring (setter)
and the request/reconcile-time fire are race-free; it is nil-safe when
unset. GORM Delete reports no error for a no-op delete, so the hook also
fires when nothing was removed; the consumer's Invalidate(model, node)
is idempotent so this is harmless.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): invalidate prefix-cache on any replica removal via registry hook
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): single source of truth for threshold bounds
Extract ValidateThresholds into prefixcache/config.go so the per-model
override validation (nodes.go endpoint) and Config.Validate share one
implementation of the numeric bounds (min_prefix_match in [0,1],
balance_abs_threshold >= 0, balance_rel_threshold == 0-or->= 1) instead
of hard-coding them in two places. The route_policy allow-list stays
explicit (not ParsePolicy, which maps typos to Default).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): preserve prefix-cache settings on partial scheduling update
A scheduling POST that omitted route_policy/thresholds (e.g. a
min_replicas-only update) full-replaced every column and silently reset
the model's previously-configured prefix-cache settings to empty/zero.
Make the four prefix-cache request fields pointers so omitted is
distinguishable from explicit zero, and merge PATCH-style in
SetSchedulingEndpoint: a provided pointer wins, an omitted one preserves
the existing config value (zero default when none). Non-prefix fields
keep their full-replace PUT semantics. Validation now runs on the
resolved values via prefixcache.ValidateThresholds.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): make Invalidate a no-op for uncached models and skip empty broadcasts
A registry chokepoint fires Sync.Invalidate(model, nodeID) for every replica
removal of every model, including round-robin models that never used the
prefix cache. Index.Invalidate previously called tree(model), which lazily
created and permanently retained an empty radix tree for any model that ever
lost a replica, growing the trees map without bound. Sync.Invalidate also
published a NATS PrefixCacheInvalidateEvent on every call, amplifying no-op
removals across the cluster.
Index.Invalidate now looks the tree up read-only via existingTree and returns
without allocating when none exists. The Provider interface is unchanged;
Sync gates the broadcast through an optional invalidateExisting(bool) capability
type-asserted from the wrapped Index, falling back to the prior always-broadcast
behavior for other Provider implementations.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* perf(prefixcache): derive Decide candidacy from WeightsFor and skip trivial sort
WeightsFor already returns a map keyed by every requested candidate, so the
separate candidates set built to validate the hot match was redundant: a node
is a candidate iff it is a key in the weights map. Drop the extra map and gate
the hot-match check on weights membership. Also skip the sort when there is at
most one candidate, since the input order is already the cold order. Behavior
is unchanged.
Deferred follow-up: skipping the WeightsFor walk entirely when a hot match wins
would need lazy cross-file changes and is out of scope here.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): fire replica-removed hook on bulk node_models deletes; trim LoadedReplicaStats columns
Bulk node-scoped node_models deletes (Register re-register cleanup,
MarkOffline, MarkDraining, Deregister) removed rows directly without
firing the replica-removed hook, so the prefix-cache index kept
pointing at nodes whose models were gone. Capture the DISTINCT model
names before each bulk delete and fire fireReplicaRemoved once per
model after a successful delete, restoring the single-chokepoint
invariant for all removal paths. The pre-query is skipped when no hook
is set so the no-hook path stays cheap.
Also narrow LoadedReplicaStats to SELECT only node_id and in_flight
(the only fields the router consumer reads), dropping the JOIN-side
available_vram fetch and unused columns while keeping the
[]ReplicaCandidate return type unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(reconciler): consume autoscale signals only on a real scale-up
scaleUp was fire-and-forget (void) yet its callers unconditionally
consumed the pressure signal (Pressure.Reset) and the MinReplicas
hysteresis (ClearUnsatisfiable) right after calling it. If scaleUp
added nothing (ScheduleAndLoadModel errored, or no node could be
loaded) the saturated warm replica got no new replica AND its
accumulated forced-disturb history was wiped, forcing the signal to
re-accumulate over a full PressureWindow before the next attempt.
Make scaleUp return whether at least one replica was actually
scheduled, and gate the side effects on it:
- pressure block (2b): set scaledUp and call Pressure.Reset only on
success; on failure preserve the signal so the next tick retries off
the same accumulated pressure.
- busy-burst block (2): set scaledUp from the return value so a failed
attempt does not suppress the pressure path or scale-down.
- MinReplicas block: call ClearUnsatisfiable only on success so a
failed attempt does not reset the unsatisfiable counter.
All existing invariants (MaxReplicas, capacity gating,
UnsatisfiableUntil cooldown, at-most-one-scale-up-per-tick) are
preserved.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(nodes): drop router's redundant prefix-cache Invalidate calls
The NodeRegistry removal chokepoint (RemoveNodeModel /
RemoveAllNodeModelReplicas) now fires SetReplicaRemovedHook, which
invalidates the prefix-cache index. The router was also calling
prefixProvider.Invalidate explicitly right after each registry removal
on the two stale-replica health-probe fall-throughs in Route and in
UnloadModel, so every router-side eviction invalidated twice (double
tree-prune + double NATS broadcast).
Remove the three redundant explicit Invalidate calls and their empty
nil-guards. Each removed call sat immediately after a registry removal
that fires the hook, so invalidation is preserved via the chokepoint.
Decide/Observe usage is untouched.
Re-point the unit test (fake registry fires no hook) to assert the
removal chokepoint is exercised on unload instead of the router's
direct invalidation.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): broadcast invalidations unconditionally for cross-frontend coherence
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(prefixcache): reject TTL<=0 in Config.Validate (eviction ticker would panic)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nodes): make capture+delete atomic in bulk node_models removal paths
MarkOffline, MarkDraining, and the Register re-register cleanup ran the
nodeModelNames SELECT and the bulk node_models DELETE as two separate
statements on r.db with no transaction. A SetNodeModel landing between
the two was deleted but its replica-removed hook never fired, leaving
the prefix-cache index pointing at a removed replica until TTL or
candidacy self-heal.
Wrap the capture and the delete in a single db.Transaction in each path
(mirroring how Deregister already does it). The captured model names are
collected into a slice declared outside the closure; the
replica-removed hook fires for each only after the transaction commits,
so a rollback never invalidates the index for a removal that did not
persist. The set of fired hooks now equals exactly the set of
node_models rows actually deleted, with no interleaving gap.
The status flip in MarkOffline/MarkDraining (setStatus) is a separate,
pre-existing operation and routing already filters non-healthy nodes, so
it stays outside the transaction; return contracts are unchanged.
Deregister was already correct and is untouched. The cheap-path skip
(no hook -> skip the SELECT) is preserved.
Adds a spec asserting MarkOffline fires hooks for exactly the rows it
deletes and leaves no node_models row behind (consistent snapshot).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(nodes): debug logging for prefix-cache routing decisions and observations
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(radixtree): match shared prefixes by valuing every node on insert
Insert recorded the value (node id) only on the final node of the key
chain, leaving every intermediate prefix node valueless. LongestMatch
returns the deepest node that hasValue, so two chains that share a
leading block but diverge in the tail never matched: only exact-repeat
queries hit. That broke the prefix-cache routing core use cases (shared
system prompt, multi-turn extension, volatile tail), all of which rely
on prefix matching rather than exact-repeat.
Set value/hasValue/lastSeen at every node along the chain so each
prefix-block node remembers the node id that served that prefix
(SGLang/vLLM-style). The deepest match wins, and the last writer owns a
shared prefix node (a recency heuristic: the most recent chain through a
block is the one most likely still warm). size now counts valued nodes,
which is the intended meaning.
Updated radixtree tests to the new semantics: deepest-prefix test uses
non-overlapping chains, a new test asserts last-writer-owns-shared-node,
Evict/Remove/MaxEntries expectations recomputed for per-prefix-node
counting, and a shared-prefix LongestMatch red test added. Added a
prefixcache Decide test proving a prefix-only query routes to the warm
node. No prefixcache .go logic changed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): lock in prefix-cache routing behavior end to end
Add a DB-backed e2e spec that drives SmartRouter against a real
NodeRegistry (Postgres testcontainer) and the real prefixcache.Index
radix-tree provider, using a fake gRPC backend factory so no real
inference runs. Covers the five behaviors validated by hand:
1. Cold miss + observe: an unseen prefix chain cold-places and is recorded.
2. Hot-match affinity: the same chain returns to its warm node X.
3. Shared-prefix match: a divergent chain sharing X's leading prefix
still routes to X (the radix-tree regression we fixed).
4. Negative control: an unrelated chain is a cold miss, not a false
hot match on X.
5. Failover + invalidation: removing X's replica fires the registry
chokepoint hook to invalidate the prefix entry, and the chain fails
over to surviving node Y and re-homes there.
Replaces the need for manual docker-compose re-runs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(prefixcache): make prefix-cache affinity replica-granular
Track prefix-cache affinity per loaded replica (a backend process with its
own KV cache) instead of per node, so multiple replicas of the same model on
one node each keep distinct affinity and a hot prefix routes back to the exact
replica that served it.
- radixtree: add RemoveFunc(pred) and reimplement Remove on top of it.
- prefixcache: introduce ReplicaKey{NodeID, Replica}; Index/Candidate/
PrefixDecision/Select/Provider now key on ReplicaKey. Add InvalidateNode to
drop every replica of a node; Invalidate drops one replica. Select returns
(ReplicaKey, bool) and gains a deterministic least-in-flight eligible
fallback (tiebreak NodeID then Replica).
- messaging: carry Replica on PrefixCacheObserveEvent and
PrefixCacheInvalidateEvent (Replica < 0 means all replicas of the node).
- Sync delegates + broadcasts with replica; InvalidateNode broadcasts
Replica=-1; ApplyInvalidate routes negative replica to InvalidateNode.
This is part 1 of 2; the registry/router/wiring consumers are updated
separately.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): make prefix-cache routing replica-granular
Wire the SmartRouter, NodeRegistry, and distributed startup to the
replica-keyed prefixcache API. Affinity is now tracked per replica
(each replica is a separate process with its own KV cache), so a prefix
served by (node,0) no longer leaks onto the same-node sibling (node,1).
- RoutePreference gains PreferredReplica; FindAndLockNodeWithModel locks
the EXACT (node_id, replica_index) row, falling through to the default
ORDER BY when that replica is not loaded.
- SetReplicaRemovedHook now carries replicaIndex; RemoveNodeModel fires
the specific replica, RemoveAllNodeModelReplicas and the four bulk
node-scoped deletes fire replica<0 (all replicas of the node).
- buildPreference builds one Candidate per loaded replica and locks the
exact replica the policy chose; observePrefix records the served
ReplicaKey at every call site.
- distributed.go routes the hook to InvalidateNode (replica<0) or
Invalidate(key).
- Tests updated to the replica-keyed API plus new coverage: a hot prefix
on (node,0) prefers replica 0 over the same-node sibling (router unit +
e2e), and FindAndLock locks the exact preferred replica.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): derive prefix chain from messages for tokenizer-template models
Prefix-cache-aware routing built its prompt-prefix chain from the rendered
prompt string `s` in ModelInference. For models with
TemplateConfig.UseTokenizerTemplate the frontend never renders a prompt - the
backend tokenizes the structured messages itself - so `s` is empty, the chain
is empty, and routing silently falls back to round-robin. That covers the bulk
of modern chat models (qwen3, llama3, ...), so the feature effectively never
engaged for them.
Fall back to messagesPrefixSource(messages): a deterministic, prefix-stable
head-first serialization of the conversation (role + content per turn). Two
requests sharing a leading system prompt and early turns share a leading byte
prefix, which ExtractChain maps to a shared chain prefix - landing both on the
same cache-warm replica. The rendered `s` is still preferred when present
(higher fidelity for non-template models).
Found via the multi-replica-per-node e2e: zero "prefix-cache routing decision"
logs despite per-request Route calls, traced to the empty-chain guard.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): document prefix-cache routing roadmap
Add a routing-and-caching roadmap section to the distributed-mode guide,
linking the epic (#10063) and the follow-up issues (#10064-#10070) surfaced
from a survey of SGLang, vLLM production-stack, Ray Serve, llm-d, AIBrix, and
NVIDIA Dynamo.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update antirez/ds4
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(ds4): link new ds4_distributed.o into grpc-server build
Upstream ds4 e16ead1e split distributed inference into a new translation
unit (ds4_distributed.c/.h). ds4.c and ds4_cpu.o now reference its
ds4_dist_* symbols, so the grpc-server link fails with undefined
references unless that object is built and linked.
Add ds4_distributed.o to both the upstream object build (Makefile) and
the grpc-server link set (CMakeLists.txt) for every GPU mode. It is a
single GPU-agnostic object, so it is built/linked unconditionally.
Verified: the six undefined ds4_dist_session_* references in ds4_cpu.o
are all defined by the newly built ds4_distributed.o (nm cross-check).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): L0 backend scaffold, LoadModel + AudioTranscription (text)
Add a Go gRPC backend that bridges LocalAI to parakeet.cpp via the flat
C-API (parakeet_capi.h), loaded with purego (cgo-less, mirrors the
whisper / vibevoice-cpp backends).
L0 scope:
- main.go: dlopen libparakeet.so (override via PARAKEET_LIBRARY), register
the C-API entry points, start the gRPC server.
- goparakeetcpp.go: Load (parakeet_capi_load), AudioTranscription
(parakeet_capi_transcribe_path, decoder=0 = per-arch default head),
Free, serialized through base.SingleThread since the C engine is a
thread-unsafe singleton. char* returns are bound as uintptr so the
malloc'd buffer is freed via parakeet_capi_free_string after copy.
- AudioTranscriptionStream returns a clear "not implemented in L0" error
(closes the channel so the server doesn't hang), wired in L2.
- Makefile: clone-at-pin + cmake (PARAKEET_VERSION for bump_deps.sh),
with a local-symlink dev shortcut; run.sh / package.sh mirror whisper.
- Test auto-skips without PARAKEET_BACKEND_TEST_MODEL/_WAV fixtures.
Builds clean (CGO_ENABLED=0), gofmt clean, test passes. The single
unsafeptr vet note in goStringFromCPtr is documented and matches the
whisper backend's tolerated pattern.
Word/segment timestamps (L1) and cache-aware streaming (L2) follow.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): L1 word/segment timestamps via transcribe_path_json
AudioTranscription now calls parakeet_capi_transcribe_path_json and shapes
the per-word / per-token timestamps into the TranscriptResult:
- Bind parakeet_capi_transcribe_path_json (purego, char* as uintptr like
the other returns) and register it in main.go + the test loader.
- Parse the JSON document ({"text","words":[{w,start,end,conf}],
"tokens":[{id,t,conf}]}) into typed structs.
- Synthesise a single whole-clip segment (parakeet emits no native segment
boundaries) spanning the first word start to the last word end; token ids
populate Segment.Tokens.
- Attach word-level timings only when timestamp_granularities=["word"],
matching the OpenAI API (segment-level default). secondsToNanos mirrors
the whisper backend's nanosecond convention.
Verified end-to-end against tdt_ctc-110m (f16): both the default and
word-granularity specs pass; builds clean, gofmt clean, vet shows only the
one documented unsafeptr note shared with the whisper backend.
Cache-aware streaming (L2) follows.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): L2 cache-aware streaming with EOU segmentation
Wire AudioTranscriptionStream to the streaming RNN-T C-API:
- Bind parakeet_capi_stream_{begin,feed,finalize,free}; feed takes 16 kHz
mono float PCM ([]float32 via purego) and writes *eou_out on <EOU>/<EOB>.
- Decode opts.Dst to 16 kHz mono PCM (utils.AudioToWav + go-audio, same as
the whisper backend), feed it in 1 s chunks, and emit each newly-finalized
text run as a TranscriptStreamResponse delta.
- <EOU>/<EOB> events close the current segment; a closing FinalResult carries
the full transcript plus the per-utterance segments (with a whole-clip
fallback segment when no EOU fired).
- stream_begin returns 0 for non-streaming models, surfaced as a clear
error instead of an empty stream. Honours context cancellation between
chunks. Frees every malloc'd delta and the session.
Verified end-to-end against realtime_eou_120m-v1 (f16): the streamed
transcript matches the offline 110m reference word-for-word, deltas
reconstruct the final text, and the spec passes alongside the offline
specs. Builds clean, gofmt clean, vet shows only the shared documented
unsafeptr note.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): L3 register backend in build/CI/gallery (whisper parity)
Wire the new Go gRPC parakeet-cpp backend (parakeet.cpp ggml port of NVIDIA
NeMo Parakeet ASR) into LocalAI's build/CI/gallery surfaces, matching the
existing ggml whisper Go backend 1:1.
- .github/backend-matrix.yml: add 11 linux entries + 1 darwin entry mirroring
every whisper build (cpu amd64/arm64, intel sycl f32/f16, vulkan amd64/arm64,
nvidia cuda-12, nvidia cuda-13, nvidia-l4t-arm64, nvidia-l4t-cuda-13-arm64,
rocm hipblas, metal-darwin-arm64), all on ./backend/Dockerfile.golang with
backend: "parakeet-cpp" and -*-parakeet-cpp tag-suffixes.
- scripts/changed-backends.js: explicit inferBackendPath branch resolving
parakeet-cpp to backend/go/parakeet-cpp/ before the generic golang branch.
- .github/workflows/bump_deps.yaml: track the PARAKEET_VERSION pin in
backend/go/parakeet-cpp/Makefile (repo mudler/parakeet.cpp, branch master).
- backend/index.yaml: add ¶keetcpp meta + latest/development image entries
for every matrix tag-suffix.
- Makefile: add backends/parakeet-cpp to .NOTPARALLEL, BACKEND_PARAKEET_CPP
definition, docker-build target eval, and test-extra-backend-parakeet-cpp-
transcription target (mirrors test-extra-backend-whisper-transcription).
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(parakeet-cpp): L4 gallery importer for parakeet GGUFs
Add ParakeetCppImporter so parakeet.cpp GGUFs auto-detect on /import-model
and route to the parakeet-cpp backend (it also surfaces in /backends/known,
which drives the import dropdown).
- Match is narrow: a .gguf whose name carries a parakeet architecture token
(<arch>-<size>-<quant>.gguf, e.g. tdt_ctc-110m-f16.gguf, rnnt-0.6b-q4_k.gguf,
realtime_eou_120m-v1-q8_0.gguf), a direct URL to one, or
preferences.backend="parakeet-cpp". It deliberately does NOT claim arbitrary
llama-style GGUFs, nor the upstream nvidia/parakeet-* NeMo repos (.nemo, not
runnable here).
- Registered in the ASR batch BEFORE LlamaCPPImporter so its GGUFs aren't
swallowed by the generic .gguf importer.
- Import nests files under parakeet-cpp/models/<name>/, defaults to the
smallest quant (q4_k, near-lossless on parakeet) with a size-ladder
fallback, and honours preferences.quantizations / name / description.
Tested with synthetic HF details (no network): metadata, positive matches
(HF repo, direct URL, preference), narrowness negatives (llama GGUF, NeMo
repo), and import (default quant, override, direct URL), 9 specs pass,
build/vet/gofmt clean.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(parakeet-cpp): document the parakeet-cpp transcription backend
Add parakeet-cpp to the audio-to-text backend list and a dedicated usage
section: direct GGUF import (auto-detects to the backend), model YAML,
word-level timestamps via timestamp_granularities[]=word, and cache-aware
streaming with the realtime_eou model. Points at the mudler/parakeet-cpp-gguf
collection repo.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(parakeet-cpp): wire transcription gRPC e2e test into test-extra
The L3 commit added the test-extra-backend-parakeet-cpp-transcription
Makefile target but never invoked it in CI. Mirror the whisper job:
- Add a parakeet-cpp output to detect-changes (emitted by
changed-backends.js from the matrix entry).
- Add tests-parakeet-cpp-grpc-transcription, gated on the parakeet-cpp
path filter / run-all, building the backend image and running the
transcription e2e against tdt_ctc-110m + the JFK clip.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(parakeet-cpp): drop em dashes from comments and docs
Replace em dashes with plain punctuation in the backend comments, the
importer, package.sh, and the audio-to-text docs section (and use "and"
instead of the multiplication sign). No behaviour change.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(gallery): add parakeet-cpp f16 models to the model gallery
Add the 10 NVIDIA Parakeet models (f16, the recommended quality/speed
default) as gallery entries that install on the parakeet-cpp backend from
mudler/parakeet-cpp-gguf: tdt_ctc-110m/1.1b, tdt-0.6b-v2/v3, tdt-1.1b,
ctc-0.6b/1.1b, rnnt-0.6b/1.1b, and the cache-aware streaming
realtime_eou_120m-v1. Each pins the file sha256 and routes transcript
usecases to the backend.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): satisfy govet lint + bump PARAKEET_VERSION
- goparakeetcpp.go: //nolint:govet on the C-owned-pointer unsafe.Pointer
conversion (golangci-lint reports new-only issues, so unlike the whisper
backend's identical line this one is flagged).
- Makefile: bump PARAKEET_VERSION to the current parakeet.cpp master commit
(the previous pin's commit no longer exists after upstream history was
squashed), so the backend image clone/build resolves again.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): pin PARAKEET_VERSION to a tag-stable commit
The previous SHA pin was orphaned when parakeet.cpp's single-commit master
was amended/force-pushed, so the backend image clone (git fetch <sha>) failed
across every build variant. Repoint to 845c29e, which upstream now keeps
permanently fetchable via the `localai-backend-pin` tag, so future upstream
amends no longer break the backend build.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): init the ggml submodule in the backend image clone
The backend Dockerfile clones parakeet.cpp at PARAKEET_VERSION with a shallow
fetch + checkout but never initialised submodules, so third_party/ggml was
empty and the parakeet.cpp cmake build failed at
`add_subdirectory(third_party/ggml)` (CMakeLists.txt:53) on every build
variant. Add `git submodule update --init --recursive --depth 1
--single-branch` after checkout, mirroring the whisper backend. Verified
locally: clone + submodule + cmake configure now succeeds.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): statically link ggml into libparakeet.so
The shared libparakeet.so linked ggml's shared libs (libggml*.so), but the
package only ships libparakeet.so, so at runtime dlopen failed with
"libggml.so.0: cannot open shared object file" (the e2e transcription test
panicked on load). Build ggml static + PIC (BUILD_SHARED_LIBS=OFF,
CMAKE_POSITION_INDEPENDENT_CODE=ON) so libparakeet.so embeds ggml and depends
only on system libs already present in the runtime image. Verified locally:
ldd shows no libggml dependency.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(parakeet-cpp): non-streaming fallback in AudioTranscriptionStream
The e2e streaming test ran AudioTranscriptionStream against tdt_ctc-110m
(not a cache-aware streaming model), so stream_begin returned 0 and the call
errored. Per LocalAI's streaming contract (and the whisper backend), a
non-streaming model should fall back to a single offline transcription
emitted as one delta plus a closing FinalResult. Do that instead of erroring,
so the streaming endpoint works for every parakeet model. Verified locally:
the streaming spec passes against the non-streaming 110m model via fallback.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI's outbound HTTP clients used Go's default redirect policy, which
follows up to 10 redirects. On a cross-host redirect Go forwards custom
request headers — including credential headers such as Anthropic's
x-api-key — to the redirect target (Go strips Authorization, Cookie and
WWW-Authenticate cross-host, but NOT arbitrary custom headers). An
attacker able to elicit a redirect from an upstream (a hijacked or
spoofed upstream, DNS trickery, or a malicious upstream_url) then
harvests the operator's provider API key.
This was first reported against the cloud-proxy / MITM PII path
(GHSA-3mj3-57v2-4636); the same class affects every other outbound
client. Rather than patch each call site, add pkg/httpclient as the one
sanctioned constructor for outbound HTTP and route everything through it.
pkg/httpclient:
- New(...) refuses redirects, TLS 1.2 floor, no body
deadline (streaming/SSE safe)
- NewWithTimeout(d) simple request/response calls
- WithFollowRedirects opt-in following that still strips credential
headers on any cross-host hop; different
scheme/host/port == different origin, guarding
the curl CVE-2022-27774 port-confusion class
- WithTransport(rt) keep a custom transport (IP-pin, HTTP/2, a
credential-injecting RoundTripper)
- HardenedTransport() base transport with the TLS floor + bounded setup
- Harden(c) apply the policy to a library-supplied *http.Client
- NoRedirect the CheckRedirect policy; wraps ErrRedirectBlocked
Lint: a forbidigo rule flags http.DefaultClient and http.Get/Post/
PostForm/Head, pointing at pkg/httpclient (.golangci.yml,
.agents/coding-style.md). forbidigo cannot match the &http.Client{}
composite literal without also flagging legitimate *http.Client type
references, so that form is enforced by review.
Migrates every non-test outbound call site across core/, pkg/, cmd/, and
the Go backend (backend/go/cloud-proxy). Credential-bearing and
internal-RPC clients refuse redirects; download / CDN / registry clients
use WithFollowRedirects so they keep working while stripping secrets
cross-host. The only credential-bearing client that follows redirects is
the gated-download path (pkg/downloader/uri.go), which strips the token
on the cross-host hop to the CDN. Hardening this closes, in passing:
- MCP remote-server bearer token leaking via a redirect (the
RoundTripper re-injected Authorization on every hop)
- agent multimedia/webhook clients leaking user-supplied auth headers
- cors_proxy following redirects, bypassing its SSRF IP-pin
- downloader's authorized read path leaking the token cross-host
Fixes: GHSA-3mj3-57v2-4636 (cloud-proxy leaks operator provider API key
(x-api-key) to attacker host on cross-host redirect)
Reported-by: tonghuaroot
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The OpenAI `reasoning_effort` field only reached the prompt template; it
never toggled the backend's thinking. Map it onto
ReasoningConfig.DisableReasoning (which becomes the enable_thinking gRPC
metadata) in the request merge, so reasoning_effort="none" disables
reasoning per request: the use case from #10072 (run a single Qwen3-style
model and turn reasoning off for low-latency tasks while keeping it on
for others).
Effort levels (minimal/low/medium/high) enable thinking unless the model
config explicitly disabled it (reasoning.disable: true wins and is never
re-enabled by a request); "none" always disables.
Closes#10072
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Two separate issues made graceful backend shutdown look ungraceful in the
logs, even though the processes were being terminated correctly
(go-processmanager defaults to process-group SIGTERM + 15s grace + SIGKILL):
1. "failed to read PID" — startProcess registers a per-process graceful-
termination handler that calls Stop(), but StopAllGRPC (registered
earlier, via app.Shutdown) already stopped and released store-tracked
backends first. The second Stop() then failed reading the removed
pidfile. Guard the handler with IsAlive() so it skips already-stopped
processes; it still covers backends StopAllGRPC doesn't track (worker-
supervised ones).
2. "Backend process exited unexpectedly" exitCode=-1 — the exit watcher
treated only exit codes 0/143 as clean. But a child killed by our own
SIGTERM/SIGKILL is reported by Go as exitCode -1 (signal termination),
not the shell's 128+signal convention, so every intentional stop logged
a false crash warning. The exit code can't distinguish an intended stop
from a signal-induced crash.
Track intent directly instead: a stoppingProcs sync.Map (keyed by the
*process.Process pointer) is marked wherever LocalAI calls Stop() on
purpose, and the exit watcher uses it to pick the log level — Info
"stopped" when intentional, Warn "exited unexpectedly" otherwise (still
catching real crashes). The raw exit code is reported as a field but no
longer interpreted.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
pkg/utils/path.go provides the security primitives for download paths
(VerifyPath, InTrustedRoot) and the file-naming helpers used by every
import flow (SanitizeFileName, GenerateUniqueFileName). None of them had
test coverage, so a future regression in the traversal check or in the
".." stripping inside SanitizeFileName would land unnoticed.
The new specs pin the lexical contract for each helper:
- VerifyPath accepts strict descendants and inner traversal that stays
inside the base, rejects "..", compound traversal, and the base path
itself. An explicit spec documents that the check is purely lexical
(filepath.Clean, not EvalSymlinks) so any future caller that needs
symlink-aware defence knows to EvalSymlinks first.
- InTrustedRoot rejects the trusted root and sibling directories,
accepts deeply nested descendants.
- SanitizeFileName covers the leading-directory and absolute-prefix
paths plus the embedded ".." case ("foo..bar" -> "foobar") that the
Clean+Base layer alone would leave intact.
- GenerateUniqueFileName covers the no-collision, single-collision,
walk-the-counter, and empty-extension cases using GinkgoT().TempDir()
so the suite stays hermetic.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: TLoE419 <tloemizuchizu@gmail.com>
Exercise the filtered empty-state path in the models gallery and verify
that the clear-filters action restores the list and resets the filter
selection.
Assisted-by: Codex:gpt-5
Signed-off-by: Ching Kao <0980124jim@gmail.com>
fix(functions): validate auto-detected XML tool-call names (#9722)
The XML tool-call auto-detector tries every preset, including glm-4.5 whose
tool block is <tool_call>name...</tool_call>. When a Hermes/NousResearch model
emits <tool_call>{"name":"bash","arguments":{...}}</tool_call>, glm-4.5
mis-claims the block and returns the entire JSON object (or leading prose, or a
JSON array) as the function NAME. The misparse then wins over the JSON parser,
so streaming clients receive a tool call whose name is a JSON blob.
Guard the auto-detect paths in ParseXMLIterative: a returned tool name must look
like a real function name ([A-Za-z0-9_.-]+). Results that don't are dropped so
auto-detection falls through to the next format and ultimately to JSON parsing,
which handles Hermes correctly. An explicitly forced format (format != nil) is
left untouched and trusted verbatim.
This supersedes PR #9940, which dropped only names with a leading "{". That
narrower check misses leading prose ("Sure: {...}"), JSON arrays ("[{...}]")
and brace-less garbage ("name: bash, ..."); the name-shape check rejects all of
them while still accepting legitimate glm-4.5 calls. The fix applies to both the
streaming worker and the non-streaming ParseFunctionCall path, which both call
ParseXMLIterative with auto-detection.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
application.New wires a fire-and-forget goroutine that runs
StopAllGRPC + distributed.Shutdown when the app context is cancelled.
Callers (tests, CLI signal handler) cancel the context and then exit
immediately, so the test binary / process can terminate before that
goroutine kills the spawned backend children. go-processmanager sets no
Pdeathsig, so the orphans are reparented to init and survive — leaving
dozens of stray mock-backend processes after an e2e run.
Add Application.Shutdown(), which runs the same cleanup synchronously on
the caller's stack and is idempotent via sync.Once. The context-cancel
goroutine, the CLI signal handler, and the test suites all call it, so
cleanup is deterministic and the duplicated teardown logic collapses to
one place. The async goroutine remains as a safety net for callers that
forget; sync.Once dedupes the double call.
Wire e2e_suite_test and the two mock-backend Contexts in app_test to
call Shutdown in their AfterSuite/AfterEach.
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Streaming /v1/chat/completions could emit the same logical tool call at
multiple `index` values. In processStreamWithTools the Go-side iterative
parser (ParseXMLIterative / ParseJSONIterative) runs on every token and
emits tool-call deltas, while the C++ chat-template autoparser delivers
its own tool calls via ChatDeltas that are flushed at end-of-stream by
ToolCallsFromChatDeltas -> buildDeferredToolCallChunks. With both paths
active the same call is emitted twice at different indices, so OpenAI
clients that accumulate tool calls by `index` dispatch the tool N times.
Skip the Go-side iterative parser once the autoparser is producing tool
calls (hasChatDeltaToolCalls). The deferred flush stays guarded by
lastEmittedCount, so the race where the Go parser emitted before the flag
flipped also remains single-emission. Backends without an autoparser
(e.g. vLLM) keep hasChatDeltaToolCalls=false and are unaffected.
Refs #9722
Signed-off-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: bozhouDev <259759010+bozhouDev@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(grammars): honor properties_order entry at index 0
The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as
its "is this key ordered?" guard. That treats index 0 — the first key listed
in properties_order — as unset, so `properties_order: name,arguments` fell
back to alphabetical ordering and still emitted "arguments" before "name".
Use presence in the order map instead: listed keys sort by their index and
ahead of unlisted keys, which keep a stable alphabetical order. This makes
the documented `properties_order: name,arguments` actually produce
name-first tool-call JSON. Relates to #10052.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(functions): defer tool grammar to the backend when the tokenizer template owns templating (#10052)
When use_tokenizer_template delegates templating to the backend (llama.cpp),
the backend also owns tool-call grammar generation and parsing. LocalAI was
still generating its own GBNF grammar and sending it down. With a grammar
present, llama.cpp does not hand the tools to its template, so its native
peg/json tool parser never engages: it streams the grammar-constrained
tool-call JSON back as plain content instead of emitting tool_calls. In
streaming mode the JSON object leaked into the content field, and the
Go-side incremental detector never gated content because the
LocalAI-generated grammar emitted "arguments" before "name".
The GGUF auto-import path already couples use_tokenizer_template with
grammar.disable, but that block is skipped when a template is already
configured, so gallery and hand-written configs (e.g. qwen3) that set the
tokenizer template directly never got the paired grammar.disable.
- SetDefaults now enforces the coupling for every config: when
use_tokenizer_template is set, grammar generation is disabled and tools
flow to the backend's native (name-first) pipeline. This also fixes
already-installed models without editing each config.
- Set function.grammar.disable in the shared gallery/qwen3.yaml, which is
the base config referenced by every qwen3 gallery entry.
Verified end to end against qwen3-4b with stream:true + tools: content no
longer carries the tool-call JSON, reasoning is classified separately, and
tool calls stream as proper name-first tool_calls deltas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(turboquant): guard upstream-only grpc-server fields for fork build
backend/cpp/llama-cpp/grpc-server.cpp is reused by the turboquant build,
which compiles against an older llama.cpp fork (TheTom/llama-cpp-turboquant).
Two recent changes added references to upstream-only struct fields outside the
existing LOCALAI_LEGACY_LLAMA_CPP_SPEC guards:
- common_params::checkpoint_min_step (default + option handler), added with
the ggml-org/llama.cpp 35c9b1f3 bump (#9998)
- the common_params_speculative::draft tensor_buft_overrides sentinel
termination (#9919), which sat after the guard's #endif
The fork has neither field, so grpc-server.cpp failed to compile for every
turboquant flavor. Wrap the three references in #ifndef
LOCALAI_LEGACY_LLAMA_CPP_SPEC, matching the existing fork-compat guards, so the
stock llama-cpp build is unchanged and the fork build skips them. Update
patch-grpc-server.sh's doc comment to record what the macro now gates out.
Verified by a local fallback-flavor turboquant build: grpc-server.cpp compiles
against the fork and the backend image builds.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* Curate the highlight.js build to ~29 languages (lib/core + the
common set) instead of the full ~190-grammar default: -787 KB raw /
-230 KB gz on the base bundle.
* Code-split every route via React.lazy with a per-layout <Suspense>
in App.jsx so the sidebar stays mounted on navigation. Initial entry
chunk drops from 3194 KB raw / 887 KB gz to 397 KB / 122 KB (-87%).
Warm chunks on sidebar hover/focus/touch via a preload registry so
the click finds the chunk already in flight or cached.
* Migrate Playwright coverage from istanbul (build-time counters) to
native Chromium V8 coverage, with per-worker accumulation +
conversion. Suite drops from 71s to 30s at 20 workers (~58%) at the
non-instrumented floor.
* Keep the coverage gate bundling-invariant: the coverage build inlines
dynamic imports so every shipped source file lands in the denominator
(otherwise untested page chunks silently drop out and inflate the
percentage). Production builds stay code-split.
* Add UI_TEST_WORKERS=N Makefile knob; tighten coverage tolerance to
0.8pp now that jitter sits near istanbul's ~0.5pp again.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(openresponses): populate Content and accept bare {role,content} items (#10039)
Fixesmudler/LocalAI#10039 — `/v1/responses` silently returned empty
output on any model whose YAML doesn't include a Go-side
`template.chat_message` block.
Three cooperating bugs:
* `convertORInputToMessages` populated only `StringContent` for string
input and for the `input.Instructions` system message, leaving the
`Content` (any) field nil.
* `TemplateMessages` gated all fallback content-rendering branches on
`Content != nil && StringContent != ""` — but every branch in that
function consumes `StringContent`, not `Content`. The `&&` silently
dropped messages that had StringContent set and Content nil, producing
an empty prompt that the 5× empty-retry guard then turned into a
200 OK with `output: []`.
* The array-input branch of `convertORInputToMessages` dispatched on
`itemMap["type"]` with no default, dropping bare `{role, content}`
items emitted by the OpenAI Python SDK helper
`client.responses.create(input=[{...}])`.
Fix:
* Set both `Content` and `StringContent` in the two openresponses
message-construction sites that only set one.
* Treat a bare `{role, content}` item (no `type`) as
`type: "message"` for OpenAI-SDK compatibility.
* Gate `TemplateMessages` fallback rendering on `StringContent != ""`,
which is what every downstream branch in that function actually
reads.
Regression test added to `evaluator_test.go` covering the fallback
path (no `ChatMessage` template) with a StringContent-only message,
both with and without a role mapping.
* test(openresponses): guard Content population and ToProto path (#10039)
Add regression tests for the two seams the original fix touched but
left uncovered:
* convertORInputToMessages must populate both Content and StringContent
for plain string input and for bare {role, content} array items (the
OpenAI SDK shape that omits the type discriminator). Both are
functional reds against the pre-fix code.
* Messages.ToProto reads Content, not StringContent — this is the path
UseTokenizerTemplate backends (imported GGUFs) take. The cases pin
that contract so a future regression on the producer side is caught.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): force .check() on hidden Toggle input in fits-filter e2e
The polish PR (#10030) swapped the raw <input type=checkbox> for the
shared <Toggle> component, which visually hides its native input via
opacity:0;width:0;height:0. Playwright's .check() waits for visibility
before clicking and times out after 30 s, breaking two UI E2E tests:
- enabling fits filter hides models that exceed available VRAM
- fits filter state persists after reload
Pass { force: true } to skip the visibility check; the input is still
the real focusable checkbox and toggles state on click. The companion
.toBeChecked() assertion only reads state and works unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* fix(react-ui): click visible Toggle track in fits-filter e2e
force:true skips the actionability checks but not the viewport check,
and the Toggle's hidden input has width:0;height:0 so Playwright still
reports "Element is outside of the viewport". Click the visible
.toggle__track inside the filter-bar-group__toggle wrapper instead —
that's what a real user clicks, and label-input association toggles
the wrapped checkbox naturally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): polish 'Fits in my GPU' filter to use design-system Toggle
The recently added VRAM-fit filter in the Models page used a raw
<input type="checkbox"> next to the themed range slider, breaking the
visual language of the rest of the row. Swap it for the shared
<Toggle> component (already used by Backends, Settings, Traces,
AgentCreate), adopt the filter-bar-group__toggle class to drop the
duplicated inline styles, add a fa-microchip icon to mirror the
per-row fit indicator, and add a subtle left divider so the filter
reads as separate from the context-size slider on its left.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): move 'Fits in GPU' filter to filter row and unify copy
Two follow-ups on the previous polish pass:
1. Move the toggle from the context-slider row into the filter-button
row above. The toggle is a filter on the result set, not a config
for VRAM estimation, so it belongs with the type chips and backend
select. The context slider stays its own thing.
2. Unify the label copy. The same locale file had "Fits in my GPU"
for the filter and "Fits in GPU" for the per-row indicator; pick
the shorter, possessive-free variant everywhere (en/de/es/it/zh-CN).
Update e2e selectors to match.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Adds a Go native gRPC backend that dlopens librfdetrcpp.so (built from
mudler/rf-detr.cpp at the pinned RFDETR_VERSION) via purego and exposes
the rfdetr.cpp inference pipeline through LocalAI's existing Detect RPC.
Supports all 5 RF-DETR detection variants (Nano/Small/Base/Medium/Large)
and 6 segmentation variants (SegNano/SegSmall/SegMedium/SegLarge/
SegXLarge/Seg2XLarge) with F32/F16/Q8_0/Q4_K quantizations. Pre-built
GGUFs ship at mudler/rfdetr-cpp-* on HuggingFace.
Detection returns Bbox + class_name + confidence; segmentation also
returns PNG-encoded per-detection masks via the rfdetr_capi accessor
functions (rfdetr_capi_get_detection_{class_id,box,score,class_name,
mask_png}).
End-to-end verified through POST /v1/detection: HTTP -> gRPC -> purego
dlopen -> rfdetr.cpp -> ggml -> response (9 detections on the detection
model, 21 detections + valid PNG masks on the seg-nano model against
the kitchen fixture).
Wiring:
- backend/go/rfdetr-cpp/{main.go,gorfdetrcpp.go,CMakeLists.txt,
Makefile,run.sh,package.sh,test.sh,.gitignore}
- Top-level Makefile: BACKEND_RFDETR_CPP, docker-build target,
.NOTPARALLEL, prepare-test-extra, test-extra
- backend/go/rfdetr-cpp/Makefile: `test` target invoked by test-extra
- .github/backend-matrix.yml: CPU + CUDA-12/13 + L4T CUDA-12/13
(arm64) + HIP + Vulkan (amd64 + arm64) + SYCL f32/f16
- backend/index.yaml: rfdetr-cpp meta anchor + latest/development
image entries for every matrix tag-suffix
- .github/workflows/bump_deps.yaml: RFDETR_VERSION pin tracking
(mudler/rf-detr.cpp branch main)
- gallery/index.yaml: 11 rfdetr-cpp-* entries (nano + 4 detection
variants + 6 seg variants), all backed by mudler/rfdetr-cpp-*
on HuggingFace with sha256 pinning on the F16 default
- core/gallery/importers/rfdetr.go: GGUF auto-routing for HF imports
(mudler/rfdetr-cpp-* repos route to rfdetr-cpp, Transformer-format
repos stay on the Python rfdetr backend; explicit preferences.backend
overrides both heuristics)
- core/gallery/importers/rfdetr_test.go: table-driven coverage of the
auto-routing + a live mudler/rfdetr-cpp-nano cross-check
scripts/changed-backends.js needs no change: the existing
Dockerfile.golang -> backend/go/${item.backend}/ branch already routes
the 9 rfdetr-cpp matrix entries to the correct backend path.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
useOperations() spun up its own setInterval per hook instance, so on
pages like /app/models the OperationsBar in App.jsx plus the page's
own useOperations() call each polled /api/operations at 1 Hz - 2 RPS
sustained for the whole session, repeated on Backends and Chat.
Lift the poller into an OperationsProvider mounted under AuthProvider
so all consumers (OperationsBar, Models, Backends, Chat) share one
timer. The hook file re-exports from the context to keep call sites
unchanged.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(nemo): extract Hypothesis.text for TDT/RNNT ASR models
CTC models (e.g. Whisper) return List[str] from transcribe(), but
TDT/RNNT models (e.g. parakeet-tdt-0.6b-v3) return List[Hypothesis]
where the decoded text lives in the Hypothesis.text attribute.
Previously, results[0] was assigned directly to the protobuf string
field, causing silent empty output for non-CTC models.
Now checks the return type and extracts .text from Hypothesis objects,
with a safe fallback via getattr().
* refactor: simplify Hypothesis text extraction per Copilot review
Use single getattr() call instead of hasattr() + double access,
and return empty string for unknown types instead of str(result)
to avoid leaking internal repr to clients.
* fix(qwen-asr): enable timestamp output when forced_aligner is configured
Two bugs prevented timestamps from working in the qwen-asr backend:
1. transcribe() was called without return_time_stamps=True, so the
forced aligner was loaded but never invoked. Now we pass
return_time_stamps=True when a forced_aligner is present.
2. The timestamp parsing code expected (list, tuple) items, but the
qwen_asr library returns ForcedAlignItem dataclass instances with
.text, .start_time, .end_time attributes. Added hasattr() check
to handle this correctly, falling back to tuple parsing for
backward compatibility.
* refactor: address Copilot review for qwen-asr timestamps
- Wrap return_time_stamps kwarg in try/except TypeError for safety
- Add defensive float() normalization for timestamp times
- Use str() for text extraction to ensure string type
* fix(qwen-asr): convert seconds to nanoseconds for Go time.Duration
The Go server reads TranscriptSegment.start/end via time.Duration,
which is in nanoseconds. Previously the backend sent milliseconds
(* 1000), causing timestamps to be 1000x too small (e.g. 8e-8
instead of 0.08). Convert seconds → nanoseconds (* 1e9) instead.
Also applies to the legacy tuple path for consistency.
* feat(qwen-asr): respect timestamp_granularities (segment vs word)
Read request.timestamp_granularities from the gRPC request.
- 'word': return one segment per aligned item (character / word)
- 'segment' (default): merge consecutive items at sentence boundaries
Sentence boundaries detected via CJK punctuation (。!?;…)
and Latin endings (. ! ? ;). This matches the OpenAI Whisper API
contract where omitting the parameter defaults to segment-level.
* fix(qwen-asr): escape smart quotes in punctuation set
Unicode curly quotes (U+2018/2019) were being interpreted as Python
string delimiters, causing SyntaxError. Use explicit unicode escapes.
* fix(qwen-asr): use time-gap threshold for segment boundaries
The forced aligner strips punctuation from its output, so text-based
sentence detection doesn't work. Instead, detect segment boundaries
by measuring time gaps between consecutive aligned items.
Threshold = max(median_gap * 4, 0.3s). This cleanly separates
intra-sentence gaps (< 0.24s) from inter-sentence gaps (> 0.3s)
across Chinese, English, and other languages.
* fix(qwen-asr): smart join with spaces for non-CJK tokens
The forced aligner strips whitespace from tokenized text, so English
words like ['hello', 'world'] were joined as 'helloworld'. Add
_smart_join() that inserts spaces between non-CJK tokens while
keeping CJK characters and punctuation unspaced. Works for Chinese,
English, Korean, Japanese, and mixed-language text.
---------
Co-authored-by: fqscfqj <fqsfqj@outlook.com>
- Strict monotonic Go coverage gate (make test-coverage-check, 45% baseline)
run in CI; fixes ginkgo dropping all-but-one coverprofile across multiple
recursive roots, builds with -tags auth, and folds in the in-process
tests/e2e suite via --coverpkg.
- React UI e2e coverage (make test-ui-coverage: vite-plugin-istanbul + nyc,
nix-provided Chromium) plus e2e specs for 6 previously-untested pages, and a
UI coverage gate (make test-ui-coverage-check) with a small tolerance since
e2e line coverage jitters ~0.5pp run-to-run.
- pre-commit hook: lint + coverage on Go changes, Playwright e2e + UI coverage
gate on react-ui changes; install with make install-hooks.
- New Go handler tests (settings, branding), hermetic base64 download test.
- fix(ui): model editor reads vram_display (snake_case), so the VRAM estimate
renders again; covered by a regression test.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
The build context shipped to the daemon included several large
untracked directories the image never needs: saved image tarballs
(backend-images), locally-installed backends (local-backends), the
host-built binary (local-ai), the rust target/ build output, and
host node_modules/protoc/tests. This bloated the context to ~23GB.
Exclude them so only the sources the Dockerfile actually copies are
transferred. backend/rust sources stay tracked; only target/ is ignored.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Walk down the release history and add per-release one-liners for 4.3.0,
4.2.0, 4.1.0, and 4.0.0 in the Latest News section, leading with the
headline win for each release. Move Prem into a collapsible "Past
sponsors" block under the active sponsors row.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.7 [claude-code]
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step
Upstream llama.cpp renamed common_params::checkpoint_every_nt to
checkpoint_min_step and changed its default from 8192 to 256. The semantics
also shifted: it used to enforce a fixed checkpoint cadence during prefill,
now it sets a minimum spacing between context checkpoints. Track the new
field name in grpc-server.cpp and accept the old option names as backward-
compatible aliases for users with existing configs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious
data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}
chunk carrying the same JSON that was already in delta.tool_calls.
The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.
Two changes:
* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
the analogous flag in processStream from #9985. Once any ChatDelta
carries content or reasoning, the flag stays on for the rest of the
stream and the worker stops falling back to the Go-side extractor for
per-token deltas. This avoids the per-chunk leak path and the cumulative
pollution.
* Extract chooseDeferredReasoning, a small helper that selects the
end-of-stream reasoning source. When preferAutoparser is set, return
functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
extractor.Reasoning() (the correct source for vLLM and other backends
with no autoparser).
The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).
E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.
Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(streaming/tools): stop healing-marker stubs from gating off content
When the C++ autoparser is in pure-content fallback mode (e.g. qwen3
without --jinja) and the model emits a tool call as JSON, the streaming
worker calls ParseJSONIterative on each new chunk. parseJSONWithStack
heals partial input like `{` into `{"<marker>":1}` where <marker> is a
random integer. removeHealingMarkerFromJSON only stripped the marker
from values, so the synthetic key survived and downstream callers saw
a stub object with a random-looking key.
chat_stream_workers.go's JSON tool-call detector then bumped
lastEmittedCount past the stub even though no real tool call was
emitted, gating off ALL subsequent content chunks. The qwen3 + tools +
streaming case ended up dribbling only the first `{"` to clients and
then nothing, even when the model went on to call the noAction
`answer({"message": "…"})` pseudo-tool.
Three changes, each with its own regression test:
* removeHealingMarkerFromJSON now strips the marker suffix from keys
too, dropping the entry when the truncated key is empty. Inputs like
`{` no longer leak `{"<marker>":1}` to callers; partial keys like
`{ "code` still preserve the model-typed prefix `code`.
* ParseJSONIterative skips empty-after-healing maps so a healed `{`
doesn't surface as a stub result.
* The streaming JSON detector now breaks (not continues) on entries
without a usable `name`, and only bumps lastEmittedCount past
successfully-emitted entries. Defense-in-depth against any future
partial-parse shape.
The parser tests cover eight partial-JSON-prefix shapes and verify no
marker characters leak into keys, plus the two early shapes (`{`,
`{"`) that should not surface a stub at all.
Fixes#9988
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(streaming/tools): cover the autoparser-correctly-working path
Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas
and unit-test it against every shape that can hit the streaming worker:
* the bug case — a healing-marker stub at index 0 must NOT bump
lastEmittedCount, so subsequent content chunks keep flowing;
* the autoparser-correctly-working case — empty jsonResults (because
the C++ autoparser cleared the raw text and delivers tool calls via
TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream
emitter to ship the autoparser's tool calls;
* a single complete tool call — emit one chunk, advance to 1;
* arguments arriving as a JSON-string vs as a nested object — both
serialize to the wire as JSON-string arguments;
* multiple parallel tool calls — one chunk each;
* a real tool call followed by a partial stub — emit the real one,
stop at the stub, resume on a later chunk once the stub completes.
Locks down the no-regression guarantee the user asked for: this PR's
fix is scoped to the pure-content fallback path; when the autoparser
actually classifies tool calls (jinja-recognized chat format with tool
support), the helper is a no-op and nothing changes.
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
feat(stablediffusion-ggml): mux LTX-2 audio into output MP4
sd.cpp's generate_video now returns a sd_audio_t* alongside the video
frames for models with an audio VAE (LTX-2.3). Our gosd wrapper was
already collecting that pointer but immediately freed it without ever
muxing it into the output, so LTX-2 generations landed as silent MP4s
even though the audio VAE decode succeeded.
Stage the planar float32 waveform to a temp WAV (IEEE float, header
hand-built; samples interleaved on the fly), then add it as a second
ffmpeg input with -c:a aac -map 0:v:0 -map 1:a:0 -shortest. The temp
WAV is cleaned up unconditionally after ffmpeg exits, including on
the write/waitpid error paths.
Non-LTX models (Wan i2v / FLF2V) keep their current behaviour: audio
arg is nullptr, the audio-related ffmpeg flags are not added, and no
temp file is created.
Assisted-by: Claude:claude-opus-4-7
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
When LocalAI templates a thinking model outside of jinja (the default for
the qwen3 gallery family), llama.cpp's chat parser falls back to a
"pure content" PEG parser that dumps the entire raw response into
ChatDelta.Content with an empty ReasoningContent. The Go side then
trusted that content verbatim and overrode tokenCallback's
correctly-split reasoning, so <think>...</think> blocks ended up in the
OpenAI `content` field. Regression from v4.0.0 introduced when the
autoparser ChatDeltas path was added (#9224).
The override now runs Go-side reasoning extraction defensively when the
autoparser delivered content but no reasoning. The streaming worker
gains a sticky preferAutoparser flag that flips on the first chunk
where the autoparser classified reasoning_content; until then we use
the streaming Go-side extractor. Realtime mirrors the non-streaming
fallback. When the autoparser already populated ReasoningContent we
trust it untouched, so jinja-enabled installs are not regressed.
gallery/qwen3.yaml now enables use_jinja, letting the autoparser
classify <think> natively for all 20+ qwen3 family entries that share
this template.
Fixes#9985
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LTX-2.3 i2v inference fails inside generate_video with:
[ERROR] LTXAV image conditioning requires VAE encoder weights;
create the context with vae_decode_only=false
Without vae_decode_only:false in the options block, gosd.cpp creates
the sd_ctx with VAE encoder weights freed, so latent encoding of the
init_image is impossible. Adding the option mirrors what we already
do for Wan i2v entries.
Affects all six LTX-2.3 entries (dev/distilled × UD-Q4_K_M, Q4_K_M,
Q8_0). T2V wasn't impacted by the missing option since it has no
init image to encode, which is why the T2V smoke earlier passed.
Assisted-by: Claude:claude-opus-4-7
LTX-2.3 entries (dev / distilled, UD-Q4_K_M / Q4_K_M / Q8_0) were
missing the `diffusion_model` option in their overrides. Without it,
gosd.cpp routes the main GGUF through the regular `model_path` code
path in sd.cpp, which doesn't apply the `model.diffusion_model.` tensor
prefix. sd.cpp's LTX-2.3 architecture detection (`VERSION_LTXAV`) in
get_sd_version checks for prefixed tensor names — without the prefix,
detection fails and load_model returns "could not load model".
This is the same bug we hit for Wan when the option was missing.
Adding `- diffusion_model` to all six LTX-2.3 entries' option blocks
makes load_model take the diffusion_model_path branch so detection
succeeds.
Assisted-by: Claude:claude-opus-4-7
When the LocalAI frontend deployment is scaled past one replica, the UI's
/api/operations poll round-robins between pods. Each pod kept the OpCache
(galleryID->jobID), OpStatus map, and the post-install in-memory caches
(ModelConfigLoader, UpgradeChecker) purely in-process. Reads never
consulted PostgreSQL or NATS even though writes already published to PG.
Symptoms:
- A user installing a model on replica A saw the operation card flicker
in and out as the load balancer alternated.
- The Models page re-fetched the whole gallery on every flicker because
useEffect([operations.length]) re-fires when the count changes.
- A chat completion that landed on replica B after the install completed
on replica A failed to find the new model — B's ModelConfigLoader was
still the old one because nothing told it to reload from disk.
- The UpgradeChecker 6-hour cache stayed stale on peer replicas after a
backend upgrade, so /api/backends/upgrades kept surfacing an upgrade
that had already shipped.
Mirror the jobs Dispatcher pattern for gallery ops:
- OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that
hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}.
Set/SetBackend now upsert cache_key + is_backend_op on the gallery_
operations row and broadcast OpCacheEvent so peers merge it in. The
hydrate path uses a new GalleryStore.ListActive() (status in {pending,
downloading, processing} and updated within 30 min).
- GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress-
Wildcard subscriber that calls a new lock-light mergeStatus into the
local statuses map, plus a SubjectGalleryCancelWildcard subscriber that
runs the locally-registered cancel func. Hydrate() restores active rows
from PostgreSQL on startup so a freshly-started replica is not
observably empty mid-install. CancelOperation tolerates the cancel func
living on a different replica and publishes anyway.
- modelHandler and backendHandler publish on the new
SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after
a successful install/delete/upgrade. SubscribeBroadcasts wires peers
to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and
OnBackendOpCompleted (re-triggers UpgradeChecker). The originating
replica reloads inline so it never enters the broadcast handler.
- OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON,
so a failed install replicated to a peer arrived with a nil error and
the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON
via an opStatusWire shim that round-trips Error as a string.
- UpdateStatus and CancelOperation now drop the mutex before publishing
to NATS or persisting to PostgreSQL. The wildcard subscriber's
mergeStatus loops back into the same service on the publishing replica
and would deadlock otherwise; this also prevents future PG round-trips
from stalling concurrent readers on every progress tick.
Tests cover the OpStatus error round-trip, OpCache propagation through a
shared in-memory bus, OpCache PostgreSQL hydration (active-only),
GalleryService progress + cancel broadcast, Nodes preservation across a
peer's bare progress tick, GalleryService hydration from PG, and the
two cache-invalidation broadcasts (models + backends). 44 specs total
in galleryop; routes/operations specs and jobs/agents suites still pass.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
stable-diffusion.cpp gained LTX-2 video generation, which requires an
audio VAE and an embeddings_connectors safetensors in addition to the
usual diffusion model, VAE, and LLM text encoder. The pinned commit
exposes audio_vae_path and embeddings_connectors_path on
sd_ctx_params_t; wire both through the option parser so gallery entries
can point at the LTX-specific assets.
Ship six LTX-2.3 GGUF gallery entries (dev + distilled, UD-Q4_K_M /
Q4_K_M / Q8_0 each) backed by a new ltx-ggml.yaml template that
defaults to euler / cfg_scale 6.0 / vae_decode_only:false /
diffusion_flash_attn / offload_params_to_cpu — matching the upstream
LTX-2 CLI recipe. Each entry pulls the model GGUF plus the QAT
gemma-3-12b-it text encoder, video VAE, audio VAE, and embeddings
connectors needed for T2V / I2V / FLF2V.
Assisted-by: Claude:claude-opus-4-7 [Claude-Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add per-model ModelLoadInfo persistence
Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from
the per-replica NodeModel rows. The reconciler can now recover model load
metadata after every NodeModel row has been removed (worker death,
eviction, MarkOffline reaping, frontend restart with stale heartbeats),
which is the read side of Bug-1 from the distributed mode bug hunt.
Registry exposes:
- UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins,
matching the existing per-replica blob semantics under concurrent
multi-frontend dispatch.
- GetModelLoadInfo: read from the new table first; fall back to the
legacy NodeModel-blob scan for rows written before any frontend in
the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition).
SetNodeModelLoadInfo (per-replica blob) is preserved for backward
compatibility and per-replica diagnostics; the dispatch-path hook in the
next commit calls both.
The new table joins the existing nodes AutoMigrate set under the same
schema-migration advisory lock.
Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* fix(distributed): persist per-model load info on dispatch
scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to
the new ModelLoadInfo table in addition to the existing per-replica
NodeModel.model_opts_blob field. The per-replica blob still works for
the hot path; the per-model row outlives every NodeModel row going away,
which is what unblocks the reconciler on the read side.
Both writes are best-effort with warn-level logging on failure: a write
miss here just means the reconciler may need a fresh inference request
to repopulate, which is the pre-fix behavior.
Concurrency: two frontends loading the same model at the same time both
fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row
converge to whichever commits last. Matches the existing per-replica
blob semantics.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* test(distributed): cover load info persistence and Bug-1 recovery
Adds Ginkgo specs that prove the persistence layer behaves correctly and
that the reconciler actually recovers from the frontend-restart scenario
that was failing in production:
registry_test.go:
- per-model row survives RemoveAllNodeModelReplicas (the bug repro)
- ON CONFLICT (model_name) updates backend type + blob, last-write-wins
- legacy NodeModel-blob fallback still works (rolling-upgrade transition)
- GetModelLoadInfo returns ErrRecordNotFound when both sources are empty
- UpsertModelLoadInfo rejects empty model names
reconciler_test.go:
- Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a
ModelLoadInfo row present, one reconcile tick fires two scheduler
calls. Pre-fix this returned "no load info" and the scheduler never
got called until a fresh inference request arrived.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(distributed): note restart-safe reconciler behavior
Adds a bullet to the Replica Reconciler section explaining that per-model
load metadata is persisted across frontend restarts via the new
model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a
fresh inference request per model before the reconciler can replace dead
replicas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add per-request node ID context holder
Introduce pkg/distributedhdr, a leaf package carrying a per-request
*atomic.Value holder for the picked worker node ID from the
SmartRouter (core/services/nodes) up to the HTTP response writer
wrapper (core/http/middleware). Avoids the import cycle that a shared
key in either consumer would create.
Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The
holder is atomic.Value so cross-goroutine publish from the router to
the response writer wrapper is race-clean.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add ExposeNodeHeader middleware + response writer wrapper
New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI
flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID
reveals internal topology and is opt-in).
The middleware creates a per-request *atomic.Value holder, attaches it
to c.Request().Context() via distributedhdr.WithHolder, and wraps
c.Response().Writer with a custom http.ResponseWriter that sets the
X-LocalAI-Node header on first Write / WriteHeader / Flush by reading
the holder. Implements http.Flusher, http.Hijacker, Unwrap so it
composes cleanly with Echo and http.NewResponseController.
request.go propagates the holder onto derived contexts via
distributedhdr.Inherit so the holder survives the correlation-ID
context replacement.
Unit + race-clean concurrency + integration specs.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): stamp node ID in router and wire middleware to inference routes
ModelRouterAdapter.Route stamps the picked node ID into the
per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right
after replica selection.
Wire ExposeNodeHeader middleware to:
- OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting
- Anthropic /v1/messages
- Ollama /api/chat, /api/generate, /api/embed, /api/embeddings
- Jina /v1/rerank
- LocalAI /v1/vad
The middleware's wrapper reads the holder on first byte and sets the
X-LocalAI-Node response header before delegating to the underlying
writer. Per-request scope means no race under concurrent multi-replica
routing.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): thread request context through backend Load + cover ctx propagation
Five non-OpenAI backend helpers were silently using app.Context instead
of the request context for the gRPC backend call: transcription, TTS,
image generation, rerank, VAD. Effect: distributedhdr.Stamp in the
router callback was a silent no-op for these paths, AND client
cancellation didn't propagate to in-flight inference.
Thread c.Request().Context() (or the equivalent input.Context after
the request middleware has installed the correlation-ID derived
context) through each helper and into ModelOptions via
model.WithContext(ctx). ImageGeneration's signature gains a leading
ctx parameter; in-tree callers (openai image, openai inpainting,
openai inpainting_test) are updated to match.
ModelEmbedding gains a leading ctx parameter for the same reason; the
openai and ollama embedding handlers pass the request context through.
chat_stream_workers.go defers the initial role=assistant chunk
emission until the first token callback so the wrapper's lazy
X-LocalAI-Node lookup against the loader runs AFTER ml.Load has
stamped the per-modelID node ID; semantically identical for clients
(role still arrives before any text).
Regression test core/backend/ctx_propagation_test.go pins ctx
propagation for all five helpers.
Docs updated to enumerate the full endpoint coverage of the
--expose-node-header flag.
Assisted-by: Claude:claude-opus-4-7[1m]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(gpu-detect): clinfo --json fallback for Intel discrete VRAM
ghw returns 0 VRAM for any i915-driven Intel GPU because the kernel
driver doesn't expose VRAM through the sysfs paths ghw checks (no
mem_info_vram_total — that's an amdgpu interface). xpu-smi, the
canonical Intel tool, isn't in the oneAPI base image (it lives in a
separate xpumanager package). The capability gate added in 19c92c70
("default to CPU if there is less than 4GB of GPU available") then
demotes the host to CPU even on a 16 GB Arc A770.
clinfo ships with the OpenCL ICD loader and is present in the oneAPI
base image, so plug it in as the last-resort Intel VRAM source:
xpu-smi -> intel_gpu_top -> clinfo --json
The parser drops UMA devices via HOST_UNIFIED_MEMORY=true so an iGPU
sibling can't double-count system RAM, and dedups by PCI BDF when
multiple ICDs enumerate the same physical device (POCL caps reported
GLOBAL_MEM_SIZE at 4 GiB; the largest non-capped value wins).
Subprocess is wrapped in a 2s timeout and memoised with sync.OnceValue
— GPU hardware is static for the process lifetime. The Intel branch
also short-circuits when ghw saw no Intel vendor, so NVIDIA-only hosts
don't pay the spawn cost.
Verified end-to-end on Intel Arc A770: ghw -> 0, clinfo path reports
16,225,243,136 bytes (15.11 GiB), capability gate now passes naturally
without LOCALAI_FORCE_META_BACKEND_CAPABILITY=intel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(gpu-detect): live VRAM usage from DRM fdinfo
The clinfo fallback reports total VRAM correctly but leaves UsedVRAM
at 0 because OpenCL has no portable live-memory property — the UI
ends up showing 0% utilisation even when llama-cpp is actually
holding gigabytes in device memory.
Fill that gap with the standardised Linux DRM fdinfo interface
(Documentation/gpu/drm-usage-stats.rst, kernel ≥5.19). Walking
/proc/<pid>/fdinfo for any fd that points at /dev/dri/render* yields
drm-total-<region> / drm-resident-<region> keys; aggregate per
render-node, resolve the render node to a PCI BDF via
/sys/class/drm/<name>/device, and merge the result into the matching
GPUMemoryInfo by BDF.
Region naming is driver-defined — i915 uses "local0" for device-local
VRAM, amdgpu and xe use "vram0" — so a prefix-match on local/vram
covers all three DRM drivers that LocalAI cares about. system/gtt/
stolen regions are deliberately excluded since they're host RAM
mirrors and would double-count against system RAM.
GPUMemoryInfo gains an optional BDF field (`bdf,omitempty` in JSON)
so future vendor-specific detectors can plug into the same matcher.
Empty BDF skips the merge — non-PCI devices and detection paths that
don't surface PCI location keep their existing behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a routing middleware stack and a cloud-proxy backend.
* cloud-proxy: a Go gRPC backend that forwards OpenAI- and
Anthropic-shaped chat requests to upstream providers, with an
optional translate mode (OpenAI request -> Anthropic /v1/messages
-> OpenAI response) and full tool-calling support.
* routing: admission control, content-aware model routing
(embedding cache + classifier + rerank + Arch-Router score),
PII detection/redaction (regex + NER) with streaming filter and
OpenAI/Anthropic adapters, and a per-user/per-key billing recorder
backed by GORM or in-memory storage.
* middleware: UsageMiddleware records usage via the billing recorder,
plus admission, route-model, usage-stamp and trace middlewares.
* observability: BackendTrace ring buffer stores full request bodies
(capped), MITM proxy emits structured trace events, and router
classifier decisions surface at /api/router/decide.
* gallery: Arch-Router-1.5B (Q4_K_M and Q8_0).
* UI: cloud-proxy model-editor fields, classifier system-prompt and
score-normalization config, and a Traces page rendering request
bodies.
Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
cosign v2.4.1 still gates --registry-referrers-mode=oci-1-1 behind the
experimental flag, so the first signing run after the backend-signing
merge failed with "you must set COSIGN_EXPERIMENTAL=1". Set it at the
job env level so both the quay and dockerhub cosign steps inherit it,
and note the requirement in .agents/backend-signing.md so a future
cosign bump can drop the flag.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel
Lifts the replica-selection policy (in_flight ASC, last_used ASC,
available_vram DESC) out of the SQL ORDER BY into a pure Go function in
the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity
and remains the production path used by SmartRouter; PickBestReplica is
the canonical implementation that the future per-frontend rotating
replica cache (TODO referenced from pkg/model) will call against an
in-memory snapshot without paying a DB round-trip per inference.
A new registry_test mirror spec seeds a multi-tier scenario and asserts
both layers pick the same replica, so any future tweak to either side
fails the test until the other side is updated.
No behavior change.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
* fix(distributed): route per inference request and cache probeHealth
Two related fixes that together restore load balancing across loaded
replicas of the same model.
1. ModelLoader.Load and LoadModel bypass the local *Model cache when
modelRouter is set. The cached *Model wraps an InFlightTrackingClient
bound to a single (nodeID, replicaIndex) — reusing it pinned every
subsequent request to whichever node won the very first pick, so
FindAndLockNodeWithModel's round-robin never got a chance to run
even after the reconciler scaled the model out to a second node. In
distributed mode SmartRouter.Route now runs per request, and
PickBestReplica picks the least-loaded replica each time.
SmartRouter has its own coalescing (advisory DB lock for first-time
loads + singleflight on backend.install RPC) so concurrent first
requests for a not-yet-loaded model still produce a single worker
side install.
2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results
in a new probeCache (probe_cache.go) with a 30s TTL. With per-request
routing every inference call hits probeHealth, and llama.cpp-style
backends serialize HealthCheck behind active Predict — so a burst of
incoming requests stalled on the probe to a node already mid-stream,
tripping the 2s timeout and falling through to the install path.
singleflight collapses N concurrent first-time probes for the same
(node, addr) into one round-trip, failed probes invalidate the entry
so the staleness-recovery path still triggers, and the TTL matches
pkg/model/model.go's healthCheckTTL so the single-process and
distributed paths share a staleness budget. The background
HealthMonitor still reaps actually-dead backends within ~45s.
The bypass introduces one short FindAndLockNodeWithModel transaction per
inference. A TODO in pkg/model/loader.go documents the future per modelID
rotating-replica cache that would reuse PickBestReplica against an
in-memory snapshot and skip the DB round-trip for hot paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(traces): cap backend trace Data field so the admin UI stays responsive
The previous fix (#9946) capped API trace bodies but missed backend traces,
which carry the same blast radius:
- LLM backend traces store the full chat messages JSON, full response, and
full streaming deltas. Every agent-pool reasoning step ships the full
RAG-augmented history (50-500 KiB per trace, often 100+ traces queued).
- TTS / audio_transform / transcript traces embed a 30s audio snippet as
base64, around 1.3 MiB per trace.
Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces
page then keeps re-downloading and re-parsing the buffer faster than the
5s auto-refresh and stays in the loading state forever, the same symptom
the API-side fix addressed.
Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES:
Option A (safety net in core/trace): RecordBackendTrace walks the Data map
recursively and replaces any string value larger than the cap with
"<truncated: N bytes>". Catches anything a future producer forgets.
Option B (head-preserving at the producer):
- core/backend/llm.go: TruncateToBytes on messages, response, and
chat_deltas content/reasoning_content so the leading content stays
readable in the UI.
- core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded
blob would exceed the cap (truncated base64 is undecodable). The
quality metrics still ship and the UI's WaveformPlayer simply skips
when the field is absent.
TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's
head-preserving output alone instead of replacing it with the bare marker.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels
The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES),
CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main
Settings page nor the inline Traces panel offered an input for it. Admins
hitting the "Traces UI stuck loading" symptom had to know to set an env var
or PUT raw JSON to /api/settings to dial the cap.
Add a "Max Body Bytes" row next to "Max Items" in both places. Same input
type, same disabled-when-tracing-off semantics, placeholder shows the 65536
default so users see what they're inheriting.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
* test(react-ui): disambiguate Max Items locator after adding Max Body Bytes
The Tracing settings panel now has two number inputs. The previous spec
matched 'input[type="number"]' which became ambiguous and triggered a
Playwright strict-mode violation in CI. Switch to getByPlaceholder('100')
for Max Items and add a parallel spec for the new Max Body Bytes field
using getByPlaceholder('65536').
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): add configurable NATS backend install/upgrade timeouts
Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig
with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout
pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter
so admin-driven backend installs across the cluster survive long OCI image
pulls that previously timed out at 3m.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(distributed): gofmt alignment after timeout fields
Re-aligns the Validate() negative-duration map and the Default* const
block so the new BackendInstall/UpgradeTimeout entries do not leave
the surrounding columns mis-padded.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT
Parses the two new env vars on the run CLI and threads them through the
existing AppOption builder so DistributedConfig picks them up. Invalid
duration strings now fail loudly at startup rather than silently falling
back to the default.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter
Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and
threads in DistributedConfig.BackendInstallTimeoutOrDefault() and
BackendUpgradeTimeoutOrDefault() at construction. Install now defaults
to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew
past the old ceiling. Scripted messaging client captures the timeout
so tests can assert the configured value actually reaches the NATS
request.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel
When the NATS request-reply for backend.install (or .upgrade) times out
the worker is almost always still pulling the OCI image. Wrap the timeout
in a typed sentinel so the manager above can distinguish "worker hung"
from "worker still working" and leave the pending_backend_ops row in
place for the reconciler to confirm via backend.list.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): treat NATS install timeout as in-progress, not failure
When a worker times out replying to backend.install but the install is
still running on the worker, enqueueAndDrainBackendOp now reports a
running_on_worker status and pushes NextRetryAt out by the install
timeout so the reconciler does not immediately re-fire another install
while the worker is still pulling the image. The pending_backend_ops
row stays in place for the next reconciler pass to confirm via
backend.list.
InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling
so callers can branch (galleryop renders yellow in-progress instead of
red error). UpgradeBackend uses the same wrap.
Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push
NextRetryAt by the configured timeout without reaching into a private
field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft
cousin of RecordPendingBackendOpFailure.
Also includes incidental gofmt-driven struct-field alignment in
registry.go on lines unrelated to the change (touched files are
re-formatted to canonical form per project policy).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): don't increment Attempts on in-flight install timeout
An in-flight timeout (worker still pulling the OCI image) is not a
failed attempt, it's a delayed one. Incrementing Attempts let
genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi)
trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter
the queue row while the worker was still legitimately working.
RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt.
Also documents "running_on_worker" in the NodeOpStatus.Status enum
comment so Task 6 implementers see the full surface.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus
When the distributed backend manager returns an error that wraps
ErrWorkerStillInstalling, backendHandler now completes the op with a
"still installing in background" message rather than marking it as a
red failure. Admin UI sees a yellow in-progress state; reconciler
confirms completion on its next pass.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): end-to-end install-timeout-then-reconcile
Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather
than during a real cluster install. NATS times out, the queue row
stays alive with running_on_worker status, the worker eventually
reports the backend installed via backend.list, the manager surfaces
it via ListBackends.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT
Add the two new operator-tunable env vars to the Frontend Configuration
table in the distributed-mode docs. Explains the 15m default, when to
raise it (slow links pulling multi-GB OCI images), and the new
"still installing in background" admin-UI state when the round-trip
times out but the worker is still working.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): clear pending install rows when backend.list confirms
DistributedBackendManager.ListBackends now proactively clears
pending_backend_ops install rows whose (nodeID, backend) is reported
installed by backend.list. Operator UI updates immediately instead of
waiting up to installTimeout (default 15m) for the next reconciler
tick after NextRetryAt.
Only install rows are cleared; upgrade and delete intents are not
satisfied by presence in backend.list and continue to drain through
their normal reconciler paths.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(messaging): add BackendInstallProgressEvent wire type and subject
New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the
worker publish transient progress events (file, current/total bytes,
percentage, phase) while a long-running install pulls its OCI image.
BackendInstallRequest gains an optional OpID field so the worker knows
which subject to publish on.
Transient pub/sub (not JetStream): the install reply remains ground
truth for success/failure; dropped progress events are tolerable.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* style(messaging): drop em-dash from BackendInstallProgress test comment
Per project convention (no em-dashes anywhere). Comment substance is
unchanged.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): worker publishes debounced install progress over NATS
When BackendInstallRequest.OpID is set, the worker's backend.install
handler wires a debounced publisher (250ms window) into the gallery
download callback. Each tick becomes a BackendInstallProgressEvent on
nodes.<nodeID>.backend.install.<opID>.progress; the publisher always
emits a final event on Flush so the UI sees the terminal percentage.
Old masters that do not set OpID continue to run silent installs: no
behavior change for them. Lock ordering: the publisher releases its
mutex before calling messaging.Publish so a slow network never stalls
the install loop.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): RemoteUnloaderAdapter subscribes to install progress
InstallBackend gains opID + onProgress parameters. When both are set,
the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress
BEFORE publishing the install request, decodes each message into the
caller's onProgress callback in a goroutine (so a slow callback never
stalls the NATS reader thread), and unsubscribes after RequestJSON
returns.
When onProgress is nil OR opID is empty (the reconciler retry path),
subscription is skipped entirely - silent installs cost nothing extra.
Subscribe failure is logged at Warn and the install proceeds without
progress streaming; the NATS round-trip still owns terminal status.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): forward backend install progress into galleryop OpStatus
DistributedBackendManager.InstallBackend now passes the gallery op ID
and a progress bridge into the adapter call. Each
BackendInstallProgressEvent from the worker becomes a
galleryop.ProgressCallback tick - which the existing backendHandler
already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling
sees per-byte progress for distributed installs without any UI-side
change.
UpgradeBackend is intentionally left silent for now: its wire request
(BackendUpgradeRequest) does not carry OpID, and rolling-update
fallback is the rarer path. Will be picked up in a follow-up if the
worker upgrade path also gets a progress channel.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers
A worker on pre-Phase-2 code never publishes progress events. The new
master subscribes optimistically; this spec pins that a silent worker
still produces a green install with no progressCb ticks. The install
reply is the source of truth for terminal state; the progress stream
is a best-effort UX enrichment.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): document install progress streaming
Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and
the silent-worker compatibility behavior so operators know to expect
real-time progress and what happens on a mixed-version cluster.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): note progress-event ordering trade-off in InstallBackend
Document near the goroutine dispatch why ordering at the consumer is
best-effort, why it rarely matters in practice (worker debounce >>
goroutine jitter), and what a future hardening pass would look like
(Seq field + stale-by-seq drop). Stops the next reader from accidentally
"fixing" the goroutine pool away.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown
Adds the data model the UI needs to render an expandable per-node
breakdown of a fanned-out backend install. NodeProgress carries node
identity (ID + name), per-node status (queued / running_on_worker /
success / error / downloading), the current file + bytes + percentage
from the Phase 2 progress stream, and any per-node error.
OpStatus.Nodes is the slice the /api/operations handler will surface
in a follow-up.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID
GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress
into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the
latest tick into the aggregate Progress / FileName /
DownloadedFileSize / TotalFileSize fields so the legacy single-bar
OperationsBar view keeps working unchanged alongside the new per-node
breakdown.
Concurrent-safe via the existing g.Mutex.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(distributed): write per-node OpStatus entries during install fan-out
DistributedBackendManager now accepts a nodeProgressSink and feeds it
two streams:
1. enqueueAndDrainBackendOp emits a per-node terminal entry on each
status it appends to BackendOpResult (queued, success, error,
running_on_worker). The opID is threaded through the function so
the sink gets the right gallery op identity.
2. The install apply closure fans each BackendInstallProgressEvent
into the sink as a downloading entry, alongside the legacy
progressCb path so the aggregate single-bar view stays correct.
Production wiring passes the GalleryService (which implements
UpdateNodeProgress via Task 2) as the sink. Single-node tests pass
nil. DeleteBackend and UpgradeBackend pass an empty opID so the
sink path no-ops for ops that aren't gallery-tracked the same way
as Install.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(operations): expose per-node breakdown on /api/operations
When an operation's OpStatus has Nodes entries (populated by the
Phase 4 progress sink wiring), surface them as a "nodes" array on the
/api/operations response, sorted by node_name for stable rendering.
Backward compatible: legacy clients ignore the field; ops without any
node entries (single-node mode, model installs) omit the array entirely
thanks to the empty-slice guard.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): per-node breakdown in OperationsBar
When an install op fans out to more than one worker, the operations
bar now shows a "N nodes" chevron that expands into a per-node list.
Each row carries the node's status (color-coded pill), the current
file being downloaded, byte counts, percentage, and a thin per-node
progress bar. Yellow "Worker busy" pill marks running_on_worker
status with a tooltip explaining the NATS round-trip timed out but
the worker is still installing in the background.
Backward compatible: ops without a nodes field (legacy or single-node
mode) render as before. State for expand/collapse is local to the
component, keyed by jobID/id - reload starts collapsed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): document per-node breakdown in the operations bar
Adds a short subsection covering the expandable "N nodes" chevron in
the OperationsBar admin UI, the meaning of each status pill, and
how it relates to the /api/operations nodes array.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(galleryop): UpdateStatus preserves Nodes when caller sends none
Real-world bug surfaced by the Phase 4 multi-worker smoke test: the
nodes[] array in /api/operations flickered between a single node at a
time on a 2-worker install. Root cause: the Phase 2 progress bridge
also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on
every tick. UpdateStatus then overwrote the entire status pointer,
wiping the Nodes slice that UpdateNodeProgress had just merged in.
Fix: in UpdateStatus, if the incoming op has an empty Nodes slice,
carry forward the previous status's Nodes before storing. Callers
that explicitly populate Nodes still win (their slice replaces the
prior one, no merge across the two code paths).
Two regression specs added pinning both directions of the contract.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(distributed): strip implementation details from user-facing docs
Trim the new install/upgrade timeout rows and the install-progress
sections to focus on what the operator sees and tunes. Drops:
- the NATS subject names and pub/sub mechanics
- "round-trip" / reconciler / backend.list jargon
- /api/operations polling cadence
- "pre-2026-05-22" version references
Reframes the breakdown text around the admin UI (Operations Bar,
chevron, status pills, "Worker busy" tooltip). Implementation context
lives in the agent notes and code comments.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(config): move DistributedConfig.Validate flag names to constants
The negative-duration check map was a wall of literal kebab-case
strings that had to stay in sync with the kong-derived CLI flag names
manually. Move them to a Flag* const block alongside the existing
Default* block so a rename of either the Go field or the CLI naming
convention forces a compile error rather than silent drift.
Sole consumer today is Validate; the constants are exported so future
operator-facing surfaces (e.g. error messages on other validation
paths) can reference them by name instead of repeating the literals.
Tests pin both the literal values (so a future "let's just rename
this" doesn't accidentally regress the CLI flag) and the negative-
duration error message for the new BackendInstall / BackendUpgrade
fields.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(distributed): extract NodeStatus and Phase enums to constants
Sweep for the same literal-string-as-identifier pattern called out on
the Validate flag names: the per-node install status enum
("queued" | "downloading" | "running_on_worker" | "success" | "error")
appeared as raw literals across managers_distributed.go (10+ sites,
including 3 separate `n.Status == "running_on_worker"` checks),
operation.go, and the test suite. Same shape for the Phase enum
("resolving" | "downloading" | "extracting" | "starting") in the
worker-side progress publisher.
Promote both to exported const blocks:
- galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error}
shared between galleryop.NodeProgress.Status (the wire field) and
nodes.NodeOpStatus.Status (the in-process per-node summary)
- messaging.Phase{Resolving,Downloading,Extracting,Starting}
shared between the worker publisher and any future consumer that
needs to switch on phase
Tests pin both the literal values (so a future "let's just rename" doesn't
silently change the JSON wire) and use the constants in setup (so the
producer side stays drift-protected). Wire-format assertions on the
/api/operations JSON output keep their literals deliberately, so the
constant value can never silently diverge from what the UI receives.
Out of scope for this PR (separate cleanup): the finetune and
quantization job-status enums have the same anti-pattern with 14+
literal sites each, but predate this PR's work.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels
The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from
pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no
version pins. That mirror started shipping torch 2.11.0 next to a
vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10
ABI, so uv landed on the mismatched pair and vllm crashed at import:
ImportError: vllm/_C.abi3.so: undefined symbol:
_ZN3c1013MessageLoggerC1EPKciib
(c10::MessageLogger's constructor signature changed between torch 2.10 and
2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so
exported only the 2.11 form.)
Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130
manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires-
Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That
makes uv's resolver produce an ABI-consistent set automatically, so the
mirror and the [tool.uv.sources] pinning are no longer needed.
flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but
vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the
main wheel, so the Dao-AILab package isn't required at runtime.
Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-*.txt
pyproject.toml only existed because uv pip install -r requirements.txt
doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv.
sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the
file no longer carries any logic the requirements-*.txt path can't.
Replace with the same two-file pattern every other build profile uses:
- requirements-l4t13.txt (accelerate / torch / transformers /
bitsandbytes - matches cublas13's split)
- requirements-l4t13-after.txt (vllm; runs after the base resolve so
the cu130 torch wheel lands first)
install.sh's whole l4t13 elif branch goes away; libbackend.sh's
installRequirements already handles the requirements-install.txt build-
deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the
runProtogen call, so falling through to the standard else: branch
produces identical install behavior with less surface area.
No functional change at install time - same wheels, same order.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels
Same root cause and same fix as the vllm backend in the previous commits:
the L4T13 sglang and vllm-omni backends both pulled their accelerator
stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version
pins, so they would silently land on the same torch 2.11 vs cu130-built
wheel ABI mismatch the moment the mirror published an out-of-sync pair.
sglang
------
- Drop pyproject.toml + [tool.uv.sources]. The historical comment said
the [all] extra was unsafe on aarch64 because of decord, but sglang
0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312
aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin
and stop being capped at the 0.5.1.post2 the L4T mirror shipped.
That unblocks Gemma 4 / MTP recipes on Jetson Thor.
- New requirements-l4t13.txt mirrors the cublas13 split (accelerate /
torch / torchvision / torchaudio / transformers), requirements-l4t13-
after.txt carries sglang[all]>=0.5.11.
- install.sh's l4t13 elif branch goes away; falls through to the
standard installRequirements path.
vllm-omni
---------
- requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and
drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its
own vllm_flash_attn fa2 + fa3 internally).
- install.sh's l4t13 vllm-install branch collapses into the cublas13
branch since both now just run `pip install vllm --torch-backend=auto`
against PyPI.
- --index-strategy=unsafe-best-match is dropped from the top-level
l4t13 guard; without the L4T mirror in the picture it had no purpose.
The from-source vllm-omni install on top still keeps its existing
`sed -i '/^fa3-fwd[[:space:]]*==/d' requirements/cuda.txt` workaround -
fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn.
Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel
CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the
[diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist
depends on scikit_build_core without declaring it in build-system.
requires, so under --no-build-isolation uv can't build it from source:
× Failed to build `xatlas==0.0.11`
├─▶ The build backend returned an error
╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
ModuleNotFoundError: No module named 'scikit_build_core'
help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12)
depends on `xatlas`
Upstream sglang explicitly gates st_attn and vsa on
`platform_machine != aarch64` inside the same [diffusion] extra but
forgot xatlas - same class of bug that bit the old decord pin.
Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base
sglang.srt symbols (Engine, ServerArgs, FunctionCallParser,
ReasoningParser); the [all] extras are optional accelerators not
required at import time. cublas13 (x86_64) keeps [all] because xatlas
has x86_64 wheels there.
Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Upstream llama.cpp defaults `cache_prompt = true` (common/common.h),
but `parse_options` in the grpc-server backend unconditionally forwards
the proto `PromptCacheAll` field, so any model that didn't set
`prompt_cache_all: true` in its YAML was getting `cache_prompt=false` —
silently overriding llama.cpp's own default. With `kv_unified` and
`cache_idle_slots` already on by default, this was the last piece
preventing the per-request prompt cache from being usable out of the
box.
Make `PromptCacheAll` tristate (`*bool`), default it to `true` in
`SetDefaults`, and dereference at the proto boundary. Users can still
opt out with an explicit `prompt_cache_all: false`. Same pattern as
`MMap`, `MMlock`, `Reranking`, etc.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In distributed mode the local /api/backend-logs WebSocket has nothing
behind it (inference runs on workers), so the "View backend logs" link
in Traces (and the action in Manage when previously not hidden) dead-
ended on /app/backend-logs/<modelId>. Manage worked around it by
hiding the action; Traces still rendered the link.
Make /app/backend-logs/:modelId the single, mode-aware entry point.
A new BackendLogsRouter probes useDistributedMode and forks:
- standalone: existing local WebSocket view (BackendLogsDetail).
- distributed: DistributedBackendLogsResolver fans out to each node
via nodesApi.getModels, filters by model_name, and routes:
* 0 hits -> empty state with a link to the Nodes page.
* 1 hit -> <Navigate replace> to
/app/node-backend-logs/<nodeId>/<modelId>,
preserving the ?from= deep-link timestamp.
* N hits -> picker listing each hosting worker (node id,
replica index, load state) so the operator can
choose which worker's logs to view.
Bare modelId in the redirect target intentionally aggregates that
node's replicas via the worker's BackendLogStore, matching the
existing per-node link pattern in Nodes.jsx.
Revert the per-caller distributed checks now that routing is
centralised: drop the hidden:distributedMode guard on Manage's
Backend logs action, and remove the prop threading in Traces so the
link is unconditional. Any future view that wants to link to backend
logs uses the same URL and gets correct behaviour in both modes.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The trace middleware buffered the full request and response bodies for every
JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses
(large vector arrays) accumulated to tens of MB in the in-memory buffer; the
admin Traces page would then download and parse 40+ MB on every load and on
every 5s auto-refresh, locking the UI in a loading state.
Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured
body. The full payload still flows through to the real client; only the
trace copy is bounded. Exchanges record body_truncated and original
body_bytes so the dashboard can show that truncation happened. The cap is
configurable via env, CLI, and runtime_settings.json.
Also unblock recovery: the Traces page now keeps the Clear button enabled
while loading, since "buffer too large to render" is exactly when the user
needs to clear it.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* chore: ignore local .worktrees directory
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(openai): stream usage non-zero when tools are enabled
The streaming chat-completions worker for tool-bearing requests
(processTools in core/http/endpoints/openai/chat.go) never forwarded the
cumulative TokenUsage from ComputeChoices to the chunks it placed on the
responses channel. The outer streaming loop's running usage tracker
therefore stayed at the zero value, and the include_usage trailer
reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever
the request carried a `tools` array. Without tools, the alternative
`process` path stamps Usage on every chunk, so that path was unaffected.
Forward the final TokenUsage via a usage-only sentinel chunk (empty
Choices, populated Usage) emitted right before close(responses). The
outer loop's per-chunk Usage capture moves above the empty-Choices skip
so the sentinel updates the tracker without ever reaching the wire,
keeping the existing OpenAI spec contract (intermediate chunks carry no
`usage` field, and the deferred-final-chunk helpers remain Usage-free
per the regression test for issue #8546).
Adds streamUsageFromTokenUsage, usageSentinelChunk, and
applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level
test that mirrors the outer-loop sequence.
Fixes#9927
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]
* refactor(openai): return final TokenUsage from stream workers
Replace the usage-only sentinel SSE chunk introduced in the previous
commit with a plain return value. The streaming workers process and
processTools (now extracted as package-level processStream and
processStreamWithTools) return (backend.TokenUsage, error); the outer
ChatEndpoint loop reads the cumulative counts off the existing `ended`
channel (now carrying streamWorkerResult{usage, err}) and builds the
include_usage trailer from a normal Go value after the LOOP exits.
This drops the empty-Choices "skip but capture Usage" rule from the
outer loop and removes the usageSentinelChunk / applyChunkToUsage
helpers entirely. The SSE responses channel is back to a single
purpose: wire chunks only.
processStream and processStreamWithTools move into chat_stream_workers.go
so they can be exercised directly from tests. The chat_stream_usage_test.go
suite now drives the workers with a mocked backend.ModelInferenceFunc
and asserts on the returned TokenUsage. The regression coverage for
issue #9927 is therefore behavioral: reverting the fix (discarding
ComputeChoices' usage return) makes the assertions fail with concrete
count mismatches.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The merged feature (#9920) let admins see per-API-key and per-source
totals but did not surface which user owned each key, and lumped
every user's Web UI traffic into a single global Web UI row. This
makes the admin Sources tab properly per-user attributable:
- KeyTotal gains UserID + UserName, populated from the snapshot the
usage middleware already records. The by_key roll-up now groups by
(api_key_id, api_key_name, user_id, user_name).
- New SourceTotals.ByUserSource roll-up groups (source, user_id,
user_name) for sources without a key identity (web, legacy). Only
populated on the admin path (includeLegacy=true); the non-admin
endpoint stays unchanged for backwards compatibility.
- SourcesTable accepts showUserColumn={isAdmin}; admin view renders
a User column, makes the search match user name/id, and expands
Web UI / legacy pseudo-rows from the global aggregate to one row
per user using by_user_source.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(galleryop): add TargetNodeID to ManagementOp for single-node installs
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(galleryop): add NodeScopedKey helpers for per-node opcache rows
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(http): make /api/nodes/:id/backends/install async via gallery service job queue
The handler previously called unloader.InstallBackend synchronously and
blocked the browser for up to 3 minutes waiting on the NATS reply. It now
enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and
returns HTTP 202 + jobID immediately, matching /api/backends/install/:id.
The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent
installs of the same backend across different nodes do not stomp each
other. galleryService/opcache/appConfig are threaded through
RegisterNodeAdminRoutes for this.
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(http): log malformed backend_galleries override and stop test drain goroutine
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(api): expose nodeID for node-scoped backend ops in /api/operations
Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>"
keys. Without splitting that prefix back out, the operations panel renders
the full key as the display name and has no structured way to label which
worker an install is targeting. Detect the prefix, surface nodeID as its own
response field, and reduce the display name back to the bare backend slug.
Bare (non-scoped) ops are left untouched so legacy installs do not gain a
misleading empty nodeID.
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(react-ui): poll job status for node-targeted backend installs
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(react-ui): clarify async semantics in handleInstallOnTarget
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(http): use statusUrl casing for node install response to match codebase precedent
Assisted-by: Claude:opus-4-7 [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The existing master push pipeline produces `master` (rolling) and
`sha-<short>` tags. Neither is orderable by build time, so downstream
GitOps that want to auto-bump to the newest master build (e.g. Flux
ImagePolicy) can't pick the latest from the tag list — alphabetical
sort over hex shas is effectively random, and the rolling `master`
tag can't be referenced as an immutable bump target.
Add a third tag of the form `master-<epoch>-<sha>` (Unix epoch in
seconds + short sha), gated on default-branch pushes via metadata-
action's `is_default_branch` predicate. The sha is retained for
traceability; the epoch makes the tags numerically orderable, so a
Flux ImagePolicy like
filterTags:
pattern: '^master-(?P<ts>[0-9]+)-[a-f0-9]+$'
extract: '$ts'
policy:
numerical:
order: asc
will reliably bump to the newest master build.
Applied to both image_build.yml (OCI labels stay consistent) and
image_merge.yml (the actual tag publisher via buildx imagetools).
utils: fail immediately on extraction errors
Setting ContinueOnError to false ensures that ExtractArchive does not
leave the model or backend directory in an inconsistent state if a
partial failure occurs. This improves robustness against malformed
archives or unexpected I/O issues during installation.
Signed-off-by: RinZ27 <222222878+RinZ27@users.noreply.github.com>
* feat(usage): add Source, APIKeyID, APIKeyName columns to UsageRecord
Adds three additive columns plus UsageSource* constants. The columns
are auto-migrated by InitDB. APIKeyID is a nullable foreign reference
to UserAPIKey.ID; APIKeyName is snapshotted on each row so revoked
keys keep showing their name in history.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): backfill Source on pre-feature usage rows
InitDB now classifies any pre-existing usage_record with an empty
source: 'legacy-api-key' user -> legacy, everything else -> web.
The backfill is idempotent (only touches NULL/empty rows).
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): add GetUserUsageBySource aggregator
Groups by (bucket, source, api_key_id, api_key_name). Filters out
legacy by default. Returns both per-bucket detail and roll-ups
(by_source, by_key sorted desc and capped at 200, grand_total).
The MAX(created_at) projection is iterated via Rows().Scan into a
string column and parsed manually because the SQLite driver surfaces
the aggregated timestamp as a string, which database/sql refuses to
scan directly into time.Time. Postgres returns a real timestamp; the
same string path handles its RFC3339 form too.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(usage): log Rows() errors and assert LastUsed in tests
Adds rows.Err() and Rows() open-failure logging in
computeSourceTotals so silent data drops surface in logs. Logs on
parseLastUsedString format misses for the same reason. Strengthens
the snapshot-survival test to assert LastUsed is a recent timestamp,
locking the SQLite time-string parser behaviour.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): add admin GetAllUsageBySource with filters and truncation
Optional user_id and api_key_id filters (composed with AND). Legacy
bucket is included for admin callers. truncated=true when more than
200 distinct keys would be in the by_key roll-up.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(auth): plumb auth_source and auth_apikey through Echo context
tryAuthenticate now sets auth_source on every successful branch
(web for session/Bearer-session, apikey for Bearer-key/x-api-key/
token-cookie, legacy for legacy env key match). For named-key
branches it also stores the resolved *UserAPIKey under auth_apikey
so downstream middlewares can snapshot id+name without re-validating.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(auth): expand tryAuthenticate godoc and cover Bearer-session branch
Documents all three context-keys side effects (auth_source,
auth_apikey, _auth_session) plus the split of responsibilities with
the parent Middleware. Adds a test for the Bearer-as-session-token
classification so future regressions there fail loudly.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): UsageMiddleware records source + snapshots key name
Reads auth_source and auth_apikey from the Echo context (set by
auth.Middleware in the previous task). Snapshots UserAPIKey.ID and
Name onto each row so revoked keys remain readable in history.
Falls back to source=web when no auth_source is set (auth disabled
or unrecognised path).
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): add /api/auth/usage/sources and admin variant
Self endpoint filters legacy server-side; admin endpoint includes
legacy and accepts user_id + api_key_id filters. Response includes
buckets, totals.{by_source, by_key, grand_total}, and a truncated
flag set when the per-key roll-up was capped at 200.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(routes): mark test mirror handlers as keep-in-sync with production
The newTestAuthApp helper duplicates production route handlers
inline because it cannot use RegisterAuthRoutes (which requires a
*application.Application). Naming the source path on each mirror
makes the drift contract explicit for future maintainers.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add usageApi.getMySources/getAdminSources + i18n strings
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): add Sources tab skeleton with data fetch
Adds Usage page tab that fetches /api/auth/usage/sources (or the
admin variant). Renders raw totals plus a placeholder key list;
real visualisations land in subsequent commits. Restructures the
existing tab button block so Models and Sources are visible to
non-admins (Users remains admin-only).
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): source mix ribbon + searchable/sortable sources table
Replaces the SourcesTab placeholder rendering with two reusable
components: SourceMixRibbon (one segmented bar per source class)
and SourcesTable (search + sort + revoked-key dim). Pulls the
current API key list to detect revoked keys.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ui): skip revoked-key detection until the key list is known
existingKeyIds defaulted to an empty Set, which made every live
api_key row render as (revoked) during the brief window before
apiKeysApi.list() resolved, and permanently after a fetch failure.
Use null as the unknown state and suppress the revoked badge until
the parent provides a real Set.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(ui): top-N stacked time chart and drill-in chip for Sources tab
Top 7 sources by total tokens get distinct colours; the rest roll up
into 'Other'. Clicking a row in the SourcesTable dims everything
except that series in the chart; the chip is the canonical clear.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(usage): document per-API-key Sources tab and endpoints
Extends features/authentication.md Usage Tracking section with:
- A 'Sources' tab description and source-class taxonomy
- Endpoint documentation for /api/auth/usage/sources and the
admin variant
- Response shape example with by_source / by_key / grand_total
- Migration note about pre-feature row backfill
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(usage): silence errcheck on deferred rows.Close
CI errcheck flagged the bare 'defer rows.Close()' in
computeSourceTotals. Wrap in a closure that discards the close
error explicitly; an error here is non-actionable since we have
already drained the rows and logged any iteration failure.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(usage): bound batcher intake and add Shutdown/FlushNow hooks
The pre-existing usage batcher had no cap on its add() path; the
usageMaxPending=5000 constant only guarded the re-queue path after
a failed write, leaving memory growth unbounded if the DB fell
behind. This commit:
- Adds the cap to add() so saturation drops new records (rate-limited
warn at 1/1024) instead of growing unbounded.
- Raises usageMaxPending to 50000 to absorb realistic inference bursts.
- Replaces the package-level batcher global with a mutex-guarded pair
plus a currentBatcher() accessor so Init / Shutdown cycles are
race-free.
- Adds ShutdownUsageRecorder() for graceful drain on process exit
(not yet wired into app shutdown, just published).
- Adds FlushNow() for deterministic tests; the middleware suite no
longer needs 6s sleeps per spec and now runs in ~50ms instead of 18s.
- Re-queue on failed flush is now cap-aware: prepends as much of the
failed batch as fits alongside concurrent arrivals, instead of
dropping the whole batch when full.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(usage): drain usage batcher on graceful shutdown
Registers ShutdownUsageRecorder with the existing
signals.RegisterGracefulTerminationHandler so SIGINT/SIGTERM
synchronously flushes any in-memory usage records before the
process exits. Without this, up to one flush interval (5s) of
recorded usage was lost when LocalAI restarted.
Refs: #9862
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Aligns LocalAI's llama-cpp gRPC backend with upstream's auto-on prompt
cache path so repeated system prompts (agents, OpenAI/Anthropic-compatible
CLIs, coding assistants) skip prefill on subsequent calls without any
YAML changes. Reported in #9921.
Upstream's server enables `kv_unified=true` (and bumps `n_parallel` to 4)
when slot count is auto, which unlocks `cache_idle_slots`. LocalAI
hardcodes `n_parallel=1` and so far also hardcoded `kv_unified=false`,
which silently force-disables idle-slot saving at server init. The host
prompt cache was allocated but never written across requests.
Changes in backend/cpp/llama-cpp/grpc-server.cpp:
- params.kv_unified: false -> true (single-slot path now benefits from
the prompt cache; users can opt out with `kv_unified:false`)
- params.n_ctx_checkpoints: 8 -> 32 (match upstream default)
- params.cache_idle_slots = true initialized explicitly (upstream default)
- params.checkpoint_every_nt = 8192 initialized explicitly (upstream default)
- New option parsers: cache_idle_slots / idle_slots_cache,
checkpoint_every_nt / checkpoint_every_n_tokens
Docs:
- features/text-generation.md: fix misleading `cache_ram` description
(it's the host-side prompt cache, not the KV cache), document the
kv_unified + cache_ram + cache_idle_slots interaction, add rows for
the two newly-exposed options, and add a worked example for the
agent/CLI workload from the issue.
- advanced/model-configuration.md: mark the legacy `prompt_cache_path`
/ `prompt_cache_all` / `prompt_cache_ro` YAML fields as unused by the
llama-cpp gRPC backend (they target upstream's CLI completion tool
and are not consumed by grpc-server.cpp) and point readers at the
new prompt-cache explainer.
Closes#9921
Assisted-by: claude:opus-4.7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
refactor(agents): bump skillserver, drop redundant Name from list_skills/search_skills
skillserver's list_skills MCP tool used to ship every entry with name=""
(field was commented out), while search_skills populated it - two tools
with inconsistent shape for the same data. skill.Name and skill.ID are
populated from the same source string anyway (the directory name), so
returning both was pure duplication.
Bumps github.com/mudler/skillserver to a7317cb, which drops the Name
field from both SkillInfo and SearchResult and leaves ID as the single
canonical identifier (already what read_skill consumes).
Adds core/services/skills/skills_mcp_test.go, a regression that drives
the LocalAI FilesystemManager through an in-process MCP session and
asserts a newly-created skill is visible by ID on the still-open session.
This is a cleanup, not the root cause of #9868 - the reporter likely
sees something deeper than a cosmetic JSON shape issue.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
llama.cpp's model loader asserts back().pattern == nullptr on
params.tensor_buft_overrides (and on params.kv_overrides.back().key[0]
== 0) before binding them into llama_model_params. PR #8560 attempted
to satisfy llama_params_fit's placeholder requirement by pre-filling
params.tensor_buft_overrides up to llama_max_tensor_buft_overrides()
*before* the option-parse loop. Any subsequent push_back from
override_tensor / draft_cpu_moe / draft_n_cpu_moe / draft_override_tensor
then appended real entries after the placeholders, leaving back() with
a real pattern and tripping the assert. The draft override vector
likewise had no terminator at all.
Mirror upstream common/arg.cpp:645-658 instead: real entries are
pushed during option parsing, and after parsing we pad the main vector
up to ntbo (placeholders land at the end, so back() is always nullptr)
and append a single {nullptr, nullptr} to the draft vector when it is
non-empty. The existing kv_overrides terminator block already matches
upstream and stays.
Verified against ggml-org/llama.cpp@5cbaa5e: only tensor_buft_overrides
(main + draft) and kv_overrides are sentinel-terminated common_params
fields; everything else is size-driven std::vector.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
useOperations() was calling setOperations() with a fresh array on every
1s poll, even when the payload was identical. In React 19 the DOM diff
no longer short-circuits dangerouslySetInnerHTML on equal __html, so the
forced Chat re-render re-assigned innerHTML on every assistant message
once per second — wiping any text the user had selected.
Skip the state update when the serialised operations payload is
unchanged, and switch loading/error to functional setters so they also
short-circuit at the source.
Also fixes the chat copy button on plain HTTP: navigator.clipboard is
undefined in non-secure contexts (a common LXC+Docker deployment), but
the previous code called it unconditionally and showed a success toast
regardless. Routed Chat, AgentChat and CanvasPanel through a new
copyToClipboard() helper that uses navigator.clipboard when available
and falls back to a hidden-textarea + execCommand('copy') trick that
browsers still honour outside secure contexts. The fallback preserves
the user's existing selection.
Regression coverage in e2e/chat-polling-selection.spec.js: a
MutationObserver counts mutations on the assistant content node across
3s of polling (must be 0); the copy test stubs out navigator.clipboard
and asserts that execCommand('copy') is invoked.
Assisted-by: claude-opus-4-7-1m
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The new ace-step.cpp revision moves backend initialization inside each
`*_load` call and drops the separate `DiTGGMLConfig` argument from
`dit_ggml_load` (config now lives in `DiTGGML::cfg`, populated from GGUF
metadata at load time). Drop the now-removed `*_init_backend` calls and
replace `g_dit_cfg` accesses with `g_dit.cfg`.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Adapt the C++ wrapper to the new `generate_video()` signature: upstream now
returns `bool` and writes frames/audio via out-parameters (`sd_image_t**`,
`sd_audio_t**`). Also set `p->fps` on the params struct (new upstream field)
and free the returned audio handle on both the success and error paths.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Non-image/non-audio file attachments (txt, md, csv, json) were being
stored in the 'files' metadata field but never added to the message
content array sent to /v1/chat/completions. Images and audio correctly
received content blocks; files did not.
Fix: push a text content block into messageContent when textContent is
present, matching the pattern used for image_url and audio_url.
Also fixes Home.jsx addFiles which never called file.text() at all,
meaning files attached on the home screen had empty textContent even
before reaching useChat.js.
Note: PDF files use file.text() which returns raw bytes rather than
parsed text. Proper PDF support would require PDF.js or server-side
extraction and is not part of this fix.
Signed-off-by: Daniel Liljeberg <damien_@hotmail.com>
The flake set `src = ./sources;` referencing a non-existent subdirectory,
so `nix build` and `nix develop` both failed evaluation. Point `src` at
the repo root and refresh `vendorHash` accordingly.
Add `devShells.default` with the Go toolchain, protobuf generators,
Node.js/bun for the React UI (`make react-ui`), and the linters used by
`make lint` (golangci-lint, gofumpt, goimports, staticcheck).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(gallery): verify backend OCI images with keyless cosign
Close a trust gap where a registry compromise or MITM could silently
replace a backend image: the gallery YAML tells LocalAI which image to
pull, but until now nothing verified the bytes came from our CI.
Consumer (pkg/oci/cosignverify):
- New package using sigstore-go to verify keyless-cosign signatures.
- OCI 1.1 referrers API + new bundle format (no legacy :tag.sig).
- Policy fields: Issuer / IssuerRegex / Identity / IdentityRegex /
NotBefore. NotBefore is the revocation lever — keyless Fulcio certs
are ephemeral so revocation is policy-side; advancing not_before in
the gallery YAML invalidates every signature predating the cutoff.
- TUF trusted root cached process-wide so N backends from one gallery
do 1 fetch, not N.
Plumbing:
- pkg/downloader: ImageVerifier interface + WithImageVerifier option
threaded through DownloadFileWithContext. Verification runs between
oci.GetImage and oci.ExtractOCIImage, with digest pinning via
pinnedImageRef to close the TOCTOU window. Skips the verifier's HEAD
when the ref is already digest-pinned.
- core/config: Gallery.Verification YAML block.
- core/gallery: backendDownloadOptions builds the verifier from the
policy; applied on initial URI, mirrors, and tag fallbacks.
- core/gallery/upgrade: the upgrade path now routes through the same
options builder. A regression Ginkgo spec pins this contract —
without it, UpgradeBackend silently bypassed verification.
- core/cli: --require-backend-integrity (LOCALAI_REQUIRE_BACKEND_INTEGRITY)
escalates missing policy / empty SHA256 from warn to hard-fail.
Producer (.github/workflows/backend_merge.yml):
- id-token: write at job scope (PR-fork-safe via existing event gate).
- sigstore/cosign-installer@v3 pinned to v2.4.1.
- After each docker buildx imagetools create, resolve the manifest
list digest and run cosign sign --recursive --new-bundle-format
--registry-referrers-mode=oci-1-1 against repo@digest. --recursive
signs the index and every per-arch entry, matching how the consumer
resolves a tag to a platform-specific manifest before verifying.
Rollout: backend/index.yaml has no `verification:` block yet, so this
PR is backward-compatible — installs proceed with a warning until the
gallery is populated. Strict mode is opt-in.
Assisted-by: claude-code:claude-opus-4-7 [Bash] [Edit] [Read] [Write] [WebSearch] [WebFetch]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(gallery): plumb RequireBackendIntegrity through config instead of env
The previous implementation re-exported the --require-backend-integrity
CLI flag into LOCALAI_REQUIRE_BACKEND_INTEGRITY via os.Setenv, then
re-read it in core/gallery via os.Getenv. This leaked process state
into the gallery package and made the flag impossible to override
per-call or test without touching the env.
Add RequireBackendIntegrity to ApplicationConfig (with a matching
WithRequireBackendIntegrity AppOption) and thread the bool through
every install/upgrade path: InstallBackend, InstallBackendFromGallery,
UpgradeBackend, InstallModelFromGallery, InstallExternalBackend,
ApplyGalleryFromString/File, startup.InstallModels. Worker subcommands
gain the same env-bound flag on WorkerFlags so distributed-worker
installs honor it consistently with the worker daemon path.
Add a forbidigo lint rule against os.Getenv / os.LookupEnv / os.Environ
to keep the env-leak pattern from creeping back. Existing offenders
(p2p, config loaders, etc.) are baseline-grandfathered by the existing
new-from-merge-base: origin/master setting; targeted path exclusions
cover the legitimate cases — kong CLI entry points, backend
subprocesses, system capability probes, gRPC AUTH_TOKEN inheritance,
test gating env vars.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(llama-cpp): bump to MTP-merge SHA and document draft-mtp spec type
Update LLAMA_VERSION to 0253fb21 (post ggml-org/llama.cpp#22673 merge,
2026-05-16) to pick up Multi-Token Prediction support.
No grpc-server.cpp changes are required: the existing `spec_type` option
delegates to upstream's `common_speculative_types_from_names()`, which
already accepts the new `draft-mtp` name. The `n_rs_seq` cparam needed
by MTP is auto-derived inside `common_context_params_to_llama` from
`params.speculative.need_n_rs_seq()`, and when no `draft_model` is set
the upstream server builds the MTP context off the target model itself.
Docs: extend the speculative-decoding section of the model-configuration
guide with the new type, both load paths (MTP head embedded in the main
GGUF vs. separate `mtp-*.gguf` sibling), the PR's recommended
`spec_n_max:2-3`, and the chained `draft-mtp,ngram-mod` recipe. Also
notes that the upstream `-hf` auto-discovery of `mtp-*.gguf` siblings is
not wired through LocalAI's gRPC layer.
Agent guide: short note explaining that new upstream spec types are
picked up automatically and that MTP needs no gRPC plumbing.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): auto-detect MTP heads and enable draft-mtp on import + load
Detect upstream's `<arch>.nextn_predict_layers` GGUF metadata key (set by
`convert_hf_to_gguf.py` for Qwen3.5/3.6 family models and similar) and,
when present and the user has not configured a `spec_type` explicitly,
auto-append the upstream-recommended speculative-decoding tuple:
- spec_type:draft-mtp
- spec_n_max:6
- spec_p_min:0.75
The 0.75 p_min is pinned defensively because upstream marks the current
default with a "change to 0.0f" TODO; locking it here keeps acceptance
thresholds stable across future llama.cpp bumps.
Detection runs in two places:
- The model importer (`POST /models/import-uri`, the `/import-model`
UI) range-fetches the GGUF header for HuggingFace / direct-URL
imports via `gguf.ParseGGUFFileRemote`, with a 30s timeout and
non-fatal error handling. OCI/Ollama URIs are skipped because the
artifact is not directly streamable; the load-time hook covers them
once the file is on disk.
- The llama-cpp load-time hook (`guessGGUFFromFile`) reads the local
header on every model start and appends the same options if
`spec_type` is not already set.
Both paths share `ApplyMTPDefaults` and respect an explicit user-set
`spec_type:` / `speculative_type:` so YAML overrides win. Ginkgo
specs cover the append, preserve-user-choice, legacy alias, and nil
safety paths.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(importer): resolve huggingface:// URIs before MTP header probe
`gguf.ParseGGUFFileRemote` only speaks HTTP(S), but the importer was
handing it the raw `huggingface://...` URI directly (and similarly for
any other custom downloader scheme). Live-test against
`huggingface://ggml-org/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-MTP-Q8_0.gguf`
exposed this: the probe failed with `unsupported protocol scheme
"huggingface"`, was caught by the non-fatal error path, and the MTP
options were silently never applied to the generated YAML.
Route every candidate URI through `downloader.URI.ResolveURL()` and
require the resolved form to be HTTP(S). After the fix the probe
successfully reads `<arch>.nextn_predict_layers=1` from the real HF
GGUF and the emitted ConfigFile carries spec_type:draft-mtp,
spec_n_max:6, spec_p_min:0.75 as intended.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(ollama): accept float-encoded integer options (num_ctx, top_k, ...)
Home Assistant's Ollama integration encodes integer options as JSON
floats (e.g. `"num_ctx": 8192.0`). Stdlib `json.Unmarshal` refuses to
decode a number with fractional notation into an `int` field, so the
entire request was rejected with HTTP 400 before reaching the backend:
Unmarshal type error: expected=int, got=number 8192.0,
field=options.num_ctx
Add a custom `UnmarshalJSON` on `OllamaOptions` that routes the int
fields (`top_k`, `num_predict`, `seed`, `repeat_last_n`, `num_ctx`)
through `*json.Number`, then converts via `Int64()` with a `Float64()`
fallback. Public field types are unchanged, so endpoint code is
untouched. Float fields and `stop` continue to parse via the default
path.
Fixes#9837
Assisted-by: Claude Code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Out-of-bounds read in SmartypantsRenderer.smartLeftAngle (CWE-125,
CVSS 7.5). Reachable transitively via LocalAGI's Email connector,
which renders inbound HTML email replies using html.CommonFlags
(includes Smartypants). An unmatched `<` in the inbound body could
panic the agent service.
Bump to v0.0.0-20260411013819-759bbc3e3207 (contains the fix). The
klauspost/compress entry loses its `// indirect` tag because
go mod tidy noticed pkg/utils/untar.go imports it directly.
Assisted-by: Claude:claude-opus-4-7 [Claude-Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* realtime: honor output_modalities to skip TTS in text-only mode
The emulated realtime pipeline previously ignored the OpenAI Realtime spec
field output_modalities and always synthesized TTS. Add resolveOutputModalities
+ modalitiesContainAudio helpers and gate the TTS / ResponseOutputAudio*
emission so a client requesting ["text"] gets only ResponseOutputText* events.
This lets thin clients (e.g. thing5-poc) cache TTS on the client side while
still using the realtime WS for VAD + STT + LLM + tool-call parsing.
Assisted-by: Claude:claude-opus-4-7
* realtime: plumb response-level output_modalities and echo on session
Follow-up to the previous commit:
- Resolve response.create's output_modalities at the gate so a per-response
override of an audio session is honored (the test asserted this contract
but the production call site was passing nil).
- Mirror OutputModalities in the RealtimeSession echo so session.update
round-trips the client-supplied value, matching MaxOutputTokens's pattern.
Assisted-by: Claude:claude-opus-4-7
* realtime: silence errcheck on deferred os.Remove of TTS file
CI's errcheck flagged the pre-existing `defer os.Remove(audioFilePath)`
inside the audio-emission block (now wrapped by the modality gate). Wrap
the call in a closure that explicitly discards the error — the canonical
Go pattern for "I want to defer a cleanup whose error I genuinely don't
care about."
Assisted-by: Claude:claude-opus-4-7 golangci-lint
---------
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The Ollama /api/tags handler passes a nil filter to galleryop.ListModels.
When ModelsPath contains any non-skipped loose file the function then
calls filter(name, nil) and panics, which Echo surfaces to clients as
"Server disconnected without sending a response" - the exact failure
Home Assistant's Ollama integration reports against LocalAI.
Mirror the nil guard already present in
ModelConfigLoader.GetModelConfigsByFilter so every caller is safe, and
add a regression test that exercises the loose-file path with a nil
filter.
Assisted-by: claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(streaming): comply with OpenAI usage / stream_options spec (#8546)
LocalAI emitted `"usage":{"prompt_tokens":0,...}` on every streamed
chunk because `OpenAIResponse.Usage` was a value type without
`omitempty`. The official OpenAI Node SDK and its consumers
(continuedev/continue, Kilo Code, Roo Code, Zed, IntelliJ Continue)
filter on a truthy `result.usage` to detect the trailing usage chunk;
LocalAI's zero-but-non-null usage on every intermediate chunk made
that filter swallow every content chunk and surface an empty chat
response while the server log looked successful.
Changes:
- `core/schema/openai.go`: `Usage *OpenAIUsage \`json:"usage,omitempty"\``
so intermediate chunks no longer carry a `usage` key. Add
`OpenAIRequest.StreamOptions` with `include_usage` to mirror OpenAI's
request field.
- `core/http/endpoints/openai/chat.go` and `completion.go`: keep using
the `Usage` struct field as an in-process channel for the running
cumulative, but strip it before JSON marshalling. When the request
set `stream_options.include_usage: true`, emit a dedicated trailing
chunk with `"choices": []` and the populated usage (matching the
OpenAI spec and llama.cpp's server behavior).
- `chat_emit.go`: new `streamUsageTrailerJSON` helper; drop the
`usage` parameter from `buildNoActionFinalChunks` since chunks no
longer carry usage.
- Update `image.go`, `inpainting.go`, `edit.go` to wrap their Usage
values with `&` for the new pointer field.
- UI: send `stream_options:{include_usage:true}` from the React
(`useChat.js`) and legacy (`static/chat.js`) chat clients so the
token-count badge keeps populating now that the server is
spec-compliant.
Tests:
- New `chat_stream_usage_test.go` pins the spec invariants:
intermediate chunks have no `usage` key, the trailer JSON has
`"choices":[]` and a populated `usage`, and `OpenAIRequest` parses
`stream_options.include_usage`.
- Update `chat_emit_test.go` to reflect that finals no longer embed
usage.
Verified against the live LocalAI instance: before the fix Continue's
filter logic swallowed 16/16 token chunks; with the new shape it
yields 4/5 and routes usage through the dedicated trailer chunk.
Fixes#8546
Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(streaming): silence errcheck on usage trailer Fprintf
The new spec-compliant `stream_options.include_usage` trailer writes
were flagged by errcheck since they're new code (golangci-lint runs
new-from-merge-base on master); the surrounding `fmt.Fprintf` data:
writes are grandfathered. Drop the return values explicitly to match
the linter's contract without adding a nolint shim.
Assisted-by: Claude:opus-4.7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.
Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).
Top-level / batching / IO:
- n_ubatch (alias ubatch) -- physical batch size; was previously
force-aliased to n_batch at line 482, blocking embedding/rerank
workloads that need independent control
- threads_batch (alias n_threads_batch) -- main-model batch threads;
mirrors the existing draft_threads_batch
- direct_io (alias use_direct_io) -- O_DIRECT model loads
- verbosity -- llama.cpp log threshold (line 479 had this commented
out)
- override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
overrides for the main model; mirrors draft_override_tensor
Embedding / multimodal:
- pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
only auto-flipped to RANK for rerankers
- embd_normalize (alias embedding_normalize) -- and the embedding
handler now reads params_base.embd_normalize instead of a hardcoded
2 at the previous embd_normalize literal in Embedding()
- mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
- image_min_tokens / image_max_tokens -- per-image vision token budget
Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
- reasoning_format -- none/auto/deepseek/deepseek-legacy parser
- enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
- prefill_assistant -- trailing-assistant-message prefill toggle
All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.
Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.
Assisted-by: claude-code:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ⬆️ Update ggml-org/llama.cpp
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* fix(llama-cpp): adapt to upstream COMMON_SPECULATIVE_TYPE_DRAFT rename
ggml-org/llama.cpp#22964 ("spec: update CLI arguments for better
consistency") renamed the speculative type enum values:
COMMON_SPECULATIVE_TYPE_DRAFT -> COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE
COMMON_SPECULATIVE_TYPE_EAGLE3 -> COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3
and the registered name strings flipped from underscore- to dash-
separated form (e.g. ngram_simple -> ngram-simple), with the bare
draft/eagle3 aliases replaced by draft-simple/draft-eagle3.
This broke the build with the new LLAMA_VERSION on every variant
(vulkan/arm64, darwin and likely all the rest) at grpc-server.cpp:461.
Update the upstream branch of the speculative-type fallback to use the
new identifier (the LOCALAI_LEGACY_LLAMA_CPP_SPEC fork branch keeps the
old name), and normalize spec_type option tokens before passing them to
common_speculative_types_from_names so existing model configs that say
spec_type:draft / spec_type:ngram_simple keep working.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
---------
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ci(image): wire singleton merges + `--` artifact separator
Closes the same singletons gap on the LocalAI server image workflow that
PR #9781 closed for backends. The user observed it as missing
:latest-gpu-nvidia-cuda-12 etc. on quay.io/go-skynet/local-ai — the
build matrix has six single-arch entries with no corresponding merge
step, so their per-arch digests push (push-by-digest=true) and never
get tagged:
- -gpu-hipblas (hipblas-jobs)
- -gpu-nvidia-cuda-12 (core-image-build)
- -gpu-nvidia-cuda-13 (core-image-build)
- -gpu-intel (core-image-build)
- -nvidia-l4t-arm64 (gh-runner)
- -nvidia-l4t-arm64-cuda-13 (gh-runner)
Only :latest, :v<X>, :latest-gpu-vulkan and :v<X>-gpu-vulkan were
actually being published before this commit (the two multiarch suffixes
that had merge jobs).
Changes:
1. image.yml: add six new merge jobs, one per single-arch entry. Each
`needs:` only its parent build job (matching the existing pattern
for core-image-merge / gpu-vulkan-image-merge).
2. image_build.yml: switch artifact name to
`digests-localai<suffix>--<platform-tag-or-"single">`. The `--`
separator anchors the merge-side glob so a singleton tag-suffix
doesn't over-match a longer suffix that shares its prefix
(-nvidia-l4t-arm64 vs -nvidia-l4t-arm64-cuda-13). Same convention
as backend_build.yml's fix.
3. image_merge.yml: update the download pattern to match.
Next master push or tag release should produce :latest-gpu-hipblas,
:latest-gpu-nvidia-cuda-12, :latest-gpu-nvidia-cuda-13, :latest-gpu-intel,
:latest-nvidia-l4t-arm64, :latest-nvidia-l4t-arm64-cuda-13 (and their
:v<X>-* equivalents) for the first time on the post-#9781 workflow.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci(image): add !cancelled() guard to all 8 image merge jobs
Parity pass with backend.yml's merge jobs (8521af14). Without
!cancelled(), GHA's default `needs:` cascade skips the merge when ANY
matrix cell of the parent build job fails or is cancelled — so a single
flaky leg would suppress publication of every other tag-suffix's
manifest list. Same fix the backend got after v4.2.1 showed 2 failed
singlearch builds cascade-skip 199 singlearch merge entries.
Applied to all 8 image merges:
- core-image-merge
- gpu-vulkan-image-merge
- gpu-nvidia-cuda-12-image-merge (added in e5300f1a)
- gpu-nvidia-cuda-13-image-merge (added in e5300f1a)
- gpu-intel-image-merge (added in e5300f1a)
- gpu-hipblas-image-merge (added in e5300f1a)
- nvidia-l4t-arm64-image-merge (added in e5300f1a)
- nvidia-l4t-arm64-cuda-13-image-merge (added in e5300f1a)
Build jobs (hipblas-jobs, core-image-build, gh-runner) are
intentionally NOT changed — they have no upstream `needs:` to cascade-
skip from.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(middleware): parse OpenAI-spec tool_choice in /v1/chat/completions
Follows up on #9526 (the 3-site setter fix) by addressing the remaining
clause in #9508 — string mode and OpenAI-spec specific-function shape both
silently failed in the /v1/chat/completions parsing path.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(middleware): restore LF endings and cover tool_choice parsing with specs
The previous commit on this branch saved core/http/middleware/request.go
with CRLF line endings, ballooning the diff against master to 684 / 651
for what is in reality a ~50-line parsing change. Restore LF (matches
.editorconfig end_of_line = lf).
Add 11 Ginkgo specs under "SetModelAndConfig tool_choice parsing
(chat completions)" that parallel the existing MergeOpenResponsesConfig
specs from #9509. They drive the full middleware chain (SetModelAndConfig
+ SetOpenAIRequest) and assert:
* "required" -> ShouldUseFunctions=true, no specific name
* "none" -> ShouldUseFunctions=false (tools disabled per OpenAI spec)
* "auto" -> default, tools available, no specific name
* {type:function, function:{name:X}} (spec) -> X is forced
* {type:function, name:X} (legacy) -> X is forced
* nested wins when both forms are present
* malformed shapes (no type, wrong type, no name, empty name) are no-ops
Update the inline comment on the string case to describe the actual
mechanism: "none" reaches SetFunctionCallString("none") downstream and
is then honored by ShouldUseFunctions() returning false. Before this PR
json.Unmarshal([]byte("none"), &functions.Tool{}) failed silently, so
"none" was ignored - making "none" actually work is a real behavior fix
this PR brings.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]
* fix(middleware): preserve pre-#9559 support for JSON-string-encoded tool_choice
Some non-spec clients send tool_choice as a JSON-encoded string of an
object form, e.g. "{\"type\":\"function\",\"function\":{\"name\":\"X\"}}".
The pre-#9559 code accepted this by accident: its case string: branch
ran json.Unmarshal([]byte(content), &functions.Tool{}), which succeeded
for that double-encoded shape even though it failed for the legitimate
plain string modes "auto" / "none" / "required".
The first version of this PR routed every string straight to
SetFunctionCallString as a mode, which fixed the plain-string cases but
silently regressed the double-encoded one (funcs.Select("{...}") returns
nothing). Restore the fallback: when a string looks like a JSON object,
try parsing it as a tool_choice map first; fall through to mode-string
handling only when no usable name comes out.
Factor the map-name extraction into a small helper
(extractToolChoiceFunctionName) so the string-fallback and the regular
map case go through identical code, and accept both the OpenAI-spec
nested shape and the legacy/Anthropic flat shape from either entry
point.
Add 3 Ginkgo specs covering the double-encoded case (nested form, legacy
form, and the fall-through when the JSON has no usable name).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4-7 [Claude Code]
* test(middleware): silence errcheck on AfterEach os.RemoveAll
The new tool_choice parsing tests added a second AfterEach that calls
os.RemoveAll(modelDir) without checking the error; errcheck flagged it.
Suppress with the standard _ = idiom. The pre-existing AfterEach on the
earlier Describe still elides the check the same way it did before -
leaving that untouched to keep this commit minimal.
Assisted-by: Claude:opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(agentpool): close truncate-then-read race in agent_jobs.json persistence
Three call sites wrote and read agent_jobs.json (and agent_tasks.json)
through three independent mutexes:
- AgentJobService.ExecuteJob spawns go saveJobs(job) -> fileJobPersister
holding p.mu
- AgentJobService.SaveJobsToFile holding service.fileMutex
- AgentJobService.LoadJobsFromFile on a separate service instance holding
a different service.fileMutex
Nothing serialized those mutexes, and both writers used os.WriteFile, which
opens O_TRUNC. A reader landing between the truncate and the write saw a
zero-byte file and surfaced as `unexpected end of JSON input` at offset 0.
The macOS tests-apple job started hitting this consistently once the path
filter was removed from .github/workflows/test.yml and the file-mode race
test ran on every push (run 25823124797 was the first observed failure).
Two changes close the window:
1. fileJobPersister.saveTasksToFile / saveJobsToFile now write to a
same-directory temp file and os.Rename to the final path. rename(2) is
atomic on POSIX, so concurrent readers see either the prior contents or
the new contents and never a zero-byte window. The helper Syncs before
close so a crash mid-write leaves either the old file intact or the temp
behind (cleaned up on next save).
2. AgentJobService.{Load,Save}{Tasks,Jobs}{FromFile,ToFile} are collapsed
to thin wrappers around fileJobPersister, removing the duplicate write
path and the redundant service.fileMutex / service.tasksFile /
service.jobsFile fields. Within a single service all task/job I/O now
serializes on the persister's mutex; the atomic rename handles the
cross-instance case the tests exercise.
Adds a regression test that hammers SaveJobsToFile and LoadJobsFromFile
concurrently for 500ms across two service instances on the same paths.
On master this reproduces `unexpected end of JSON input` on Linux within
~500ms; with the fix the suite ran -until-it-fails for 30s (54 attempts,
all green).
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* refactor(agentpool): route service flush/load through JobPersister interface
The first cut of the race fix made AgentJobService.{Save,Load}{Tasks,Jobs}*
type-assert s.persister to *fileJobPersister so they could reach the
unexported saveTasksToFile / saveJobsToFile helpers. That defeats the
JobPersister interface: the service is back to reasoning about a concrete
implementation instead of an abstraction.
Promote the bulk-flush operations to the interface as FlushTasks / FlushJobs:
- fileJobPersister.FlushTasks/FlushJobs call the existing private helpers
(atomic temp+rename writes from the prior commit).
- dbJobPersister.FlushTasks/FlushJobs are no-ops because SaveTask/SaveJob
are already write-through to the database.
The service's four file-named methods now talk only to the interface:
LoadTasks/LoadJobs read through s.persister.LoadTasks/LoadJobs, and the
Save side calls FlushTasks/FlushJobs. The "FromFile"/"ToFile" suffixes
stay for backward compat with user_services.go and the existing tests,
but they no longer claim a file-only contract.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of 8521af14 (which fixed backend_merge.yml) for image_merge.yml.
Today's master-push run 25823024353 failed the gpu-vulkan-image-merge job
with the exact same error pattern the backend merge had on v4.2.2:
ERROR: quay.io/go-skynet/local-ai@sha256:68b22611...: not found
Same root cause: image_build.yml pushes the per-arch manifest to
quay.io/go-skynet/local-ai with push-by-digest=true (no tag), then the
merge runs minutes-to-hours later, by which time quay's per-repo manifest
GC has reaped the untagged digest from local-ai. The blob still lives in
quay's storage but local-ai@<digest> no longer resolves.
Three matching edits:
1. image_build.yml: anchor each per-arch digest into ci-cache immediately
after the push, reusing .github/scripts/anchor-digest-in-cache.sh with
SOURCE_IMAGE=quay.io/go-skynet/local-ai and TAG_SUFFIX defaulting to
"-core" for the core image (matches the artifact-name convention).
2. image_merge.yml: change the quay merge source from local-ai@<digest>
to ci-cache@<digest>. Same correctness argument as backend_merge.yml —
the manifest content is alive in ci-cache; buildx imagetools create
republishes it into local-ai and writes the user-facing manifest list
pointing at it. End state in local-ai is self-contained.
3. image_merge.yml: add a sparse `actions/checkout@v6` (only
.github/scripts) so cleanup-keepalive-tags.sh is available, plus the
cleanup step itself with TAG_SUFFIX matching the anchor's "-core"
placeholder.
v4.2.3's image.yml run completed successfully (~50 min between push and
merge — beat quay's GC). This commit closes the race for future releases
and master pushes regardless of run length.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(http): honor X-Forwarded-Prefix when proxy strips the prefix
Closes#9145.
Two related issues kept the React UI from loading when a reverse proxy
rewrites a sub-path with prefix-stripping (e.g. Caddy `handle_path`):
1. `BaseURL` only computed a prefix from the path StripPathPrefix had
removed, so when the proxy strips the prefix before forwarding, the
request arrives without it and the base URL was returned without a
prefix. Extract a `BasePathPrefix` helper and add an
`X-Forwarded-Prefix` header fallback so the prefix is recovered.
2. `<base href>` only changes how relative URLs resolve; the build
emits path-absolute references like `/assets/...` and
`/favicon.svg`, which still resolve against the origin and bypass
the proxy prefix. Rewrite those references in the served
`index.html` so the browser requests them through the proxy.
Adds unit coverage for `BaseURL` with a pre-stripped path and an
end-to-end test for the proxy-stripped scenario.
Assisted-by: Claude:claude-opus-4-7
* fix(http): gate X-Forwarded-Prefix through SafeForwardedPrefix in BasePathPrefix
BasePathPrefix consumed X-Forwarded-Prefix directly, so a value the
codebase elsewhere rejects (e.g. "//evil.com") slipped through and was
interpolated into the SPA index.html — both into the path-absolute asset
URL rewrite in serveIndex (turning "/assets/..." into "//evil.com/assets/...",
a protocol-relative URL that loads JS from a foreign origin) and into
<base href>. Route the header through the existing SafeForwardedPrefix
validator that StripPathPrefix and prefixRedirect already use, and
HTML-escape the prefix before injecting it into the asset rewrite as
defense in depth against attribute breakout.
Tests cover //evil.com, backslashes, control chars, CR/LF and a missing
leading slash; the integration test asserts an unsafe prefix can't poison
asset URLs.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): cascade-clean stale node_models on drain and filter routing by healthy status
Stale node_models rows (state="loaded") were surviving past the healthy
state of their owning node, causing /embeddings (and other inference
paths) to dispatch to a backend whose process was gone or drained. The
downstream symptom in a live cluster was pgvector rejecting inserts
with "vector cannot have more than 16000 dimensions (SQLSTATE 54000)"
because the misbehaving backend silently returned a malformed
(oversized) tensor; the Models page showed the model as "running"
without an associated node, like a stale entry, even though the node
was no longer visible in the Nodes view.
Two changes here, plus a third in a follow-up commit:
- MarkDraining now cascade-deletes node_models rows for the affected
node, mirroring MarkOffline. Drains are explicit operator actions —
the box has been intentionally taken out of rotation — so clearing
the rows stops the Models UI from misreporting and prevents the
routing layer from picking those rows if scheduling logic is ever
relaxed. In-flight requests already hold their gRPC client through
Route() and finish normally; the only observable effect is a
non-fatal IncrementInFlight warning, acceptable for a drain.
MarkUnhealthy is deliberately left status-only: it fires from
managers_distributed / reconciler on a single nats.ErrNoResponders
with no retry, so a transient NATS hiccup must not nuke every loaded
model and force a full reload on recovery.
- FindAndLockNodeWithModel's inner JOIN now filters on
backend_nodes.status = healthy in addition to node_models.state =
loaded. The previous version relied on the second node-fetch step to
reject non-healthy nodes, but a concurrent reader could still pick
the same stale row in the same window. Belt-and-braces.
- DistributedConfig.PerModelHealthCheck renamed to
DisablePerModelHealthCheck and inverted at the call site so
per-model gRPC probing is on by default. The probe (now made
consecutive-miss aware in a follow-up commit) independently health-
checks each model's gRPC address and removes stale node_models rows
when the backend has crashed even though the worker's node-level
heartbeat is still arriving.
Migration: the field had no CLI flag, env var binding, or YAML key
in tree (only the bare struct field), so there is no user-facing
migration. Anything constructing DistributedConfig in code needs to
drop the assignment (default now does the right thing) or invert it.
Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(distributed): require consecutive misses before per-model probe removes a row
The per-model gRPC probe used to remove a node_models row on a single
failed health check. With the per-model probe now on by default, that
made any 5-second gRPC blip (network jitter, a long-running request
hogging the worker's gRPC server thread, brief GC pause) trigger a
full reload of the affected model — too eager for production.
Require perModelMissThreshold (3) consecutive failed probes before
removal. At the default 15s tick a model must be unreachable for ~45s
before reap; a single successful probe in between resets the streak.
Per-(node, model, replica) state tracked under a mutex on the monitor.
If the removal call itself fails, the miss counter is left in place
so the next tick retries rather than starting the streak over.
Tests:
- removes stale model via per-model health check after consecutive
failures (replaces the single-shot expectation)
- preserves model row when an intermittent failure is followed by a
success (covers the reset-on-success path and verifies the counter
reset by failing twice more without crossing threshold)
- newTestHealthMonitor initializes the misses map so direct-construct
test helpers don't nil-map-panic in the probe path
Assisted-by: Claude:claude-opus-4-7 go-vet go-test golangci-lint
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* feat(liquid-audio): add LFM2.5-Audio any-to-any backend + realtime_audio usecase
Wires LiquidAI's LFM2.5-Audio-1.5B as a self-contained Realtime API model:
single engine handles VAD, transcription, LLM, and TTS in one bidirectional
stream — drop-in alternative to a VAD+STT+LLM+TTS pipeline.
Backend
- backend/python/liquid-audio/ — new Python gRPC backend wrapping the
`liquid-audio` package. Modes: chat / asr / tts / s2s, voice presets,
Load/Predict/PredictStream/AudioTranscription/TTS/VAD/AudioToAudioStream/
Free and StartFineTune/FineTuneProgress/StopFineTune. Runtime monkey-patch
on `liquid_audio.utils.snapshot_download` so absolute local paths from
LocalAI's gallery resolve without a HF round-trip. soundfile in place of
torchaudio.load/save (torchcodec drags NVIDIA NPP we don't bundle).
- backend/backend.proto + pkg/grpc/{backend,client,server,base,embed,
interface}.go — new AudioToAudioStream RPC mirroring AudioTransformStream
(config/frame/control oneof in; typed event+pcm+meta out).
- core/services/nodes/{health_mock,inflight}_test.go — add stubs for the
new RPC to the test fakes.
Config + capabilities
- core/config/backend_capabilities.go — UsecaseRealtimeAudio, MethodAudio
ToAudioStream, UsecaseInfoMap entry, liquid-audio BackendCapability row.
- core/config/model_config.go — FLAG_REALTIME_AUDIO bitmask, ModalityGroups
membership in both speech-input and audio-output groups so a lone flag
still reads as multimodal, GetAllModelConfigUsecases entry, GuessUsecases
branch.
Realtime endpoint
- core/http/endpoints/openai/realtime.go — extract prepareRealtimeConfig()
so the gate is unit-testable; accept realtime_audio models and self-fill
empty pipeline slots with the model's own name (user-pinned slots win).
- core/http/endpoints/openai/realtime_gate_test.go — six specs covering nil
cfg, empty pipeline, legacy pipeline, self-contained realtime_audio,
user-pinned VAD slot, and partial legacy pipeline.
UI + endpoints
- core/http/routes/ui.go — /api/pipeline-models accepts either a legacy
VAD+STT+LLM+TTS pipeline or a realtime_audio model; surfaces a
self_contained flag so the Talk page can collapse the four cards.
- core/http/routes/ui_api.go — realtime_audio in usecaseFilters.
- core/http/routes/ui_pipeline_models_test.go — covers both code paths.
- core/http/react-ui/src/pages/Talk.jsx — self-contained badge instead of
the four-slot grid; rename Edit Pipeline → Edit Model Config; less
pipeline-specific wording.
- core/http/react-ui/src/pages/Models.jsx + locales/en/models.json — new
realtime_audio filter button + i18n.
- core/http/react-ui/src/utils/capabilities.js — CAP_REALTIME_AUDIO.
- core/http/react-ui/src/pages/FineTune.jsx — voice + validation-dataset
fields, surfaced when backend === liquid-audio, plumbed via
extra_options on submit/export/import.
Gallery + importer
- gallery/liquid-audio.yaml — config template with known_usecases:
[realtime_audio, chat, tts, transcript, vad].
- gallery/index.yaml — four model entries (realtime/chat/asr/tts) keyed by
mode option. Fixed pre-existing `transcribe` typo on the asr entry
(loader silently dropped the unknown string → entry never surfaced as a
transcript model).
- gallery/lfm.yaml — function block for the LFM2 Pythonic tool-call format
`<|tool_call_start|>[name(k="v")]<|tool_call_end|>` matching
common_chat_params_init_lfm2 in vendored llama.cpp.
- core/gallery/importers/{liquid-audio,liquid-audio_test}.go — detector
matches LFM2-Audio HF repos (excludes -gguf mirrors); mode/voice
preferences plumbed through to options.
- core/gallery/importers/importers.go — register LiquidAudioImporter
before LlamaCPPImporter.
- pkg/functions/parse_lfm2_test.go — seven specs for the response/argument
regex pair on the LFM2 pythonic format.
Build matrix
- .github/backend-matrix.yml — seven liquid-audio targets (cuda12, cuda13,
l4t-cuda-13, hipblas, intel, cpu amd64, cpu arm64). Jetpack r36 cuda-12
is skipped (Ubuntu 22.04 / Python 3.10 incompatible with liquid-audio's
3.12 floor).
- backend/index.yaml — anchor + 13 image entries.
- Makefile — .NOTPARALLEL, prepare-test-extra, test-extra,
docker-build-liquid-audio.
Docs
- .agents/plans/liquid-audio-integration.md — phased plan; PR-D (real
any-to-any wiring via AudioToAudioStream), PR-E (mid-audio tool-call
detector), PR-G (GGUF entries once upstream llama.cpp PR #18641 lands)
remain.
- .agents/api-endpoints-and-auth.md — expand the capability-surface
checklist with every place a new FLAG_* needs to be registered.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): function calling + history cap for any-to-any models
Three pieces, all on the realtime_audio path that just landed:
1. liquid-audio backend (backend/python/liquid-audio/backend.py):
- _build_chat_state grows a `tools_prelude` arg.
- new _render_tools_prelude parses request.Tools (the OpenAI Chat
Completions function array realtime.go already serialises) and
emits an LFM2 `<|tool_list_start|>…<|tool_list_end|>` system turn
ahead of the user history. Mirrors gallery/lfm.yaml's `function:`
template so the model sees the same prompt shape whether served
via llama-cpp or here. Without this the backend silently dropped
tools — function calling was wired end-to-end on the Go side but
the model never saw a tool list.
2. Realtime history cap (core/http/endpoints/openai/realtime.go):
- Session grows MaxHistoryItems int; default picked by new
defaultMaxHistoryItems(cfg) — 6 for realtime_audio models (LFM2.5
1.5B degrades quickly past a handful of turns), 0/unlimited for
legacy pipelines composing larger LLMs.
- triggerResponse runs conv.Items through trimRealtimeItems before
building conversationHistory. Helper walks the cut left if it
would orphan a function_call_output, so tool result + call pairs
stay intact.
- realtime_gate_test.go: specs for defaultMaxHistoryItems and
trimRealtimeItems (zero cap, under cap, over cap, tool-call pair
preservation).
3. Talk page (core/http/react-ui/src/pages/Talk.jsx):
- Reuses the chat page's MCP plumbing — useMCPClient hook,
ClientMCPDropdown component, same auto-connect/disconnect effect
pattern. No bespoke tool registry, no new REST endpoints; tools
come from whichever MCP servers the user toggles on, exactly as
on the chat page.
- sendSessionUpdate now passes session.tools=getToolsForLLM(); the
update re-fires when the active server set changes mid-session.
- New response.function_call_arguments.done handler executes via
the hook's executeTool (which round-trips through the MCP client
SDK), then replies with conversation.item.create
{type:function_call_output} + response.create so the model
completes its turn with the tool output. Mirrors chat's
client-side agentic loop, translated to the realtime wire shape.
UI changes require a LocalAI image rebuild (Dockerfile:308-313 bakes
react-ui/dist into the runtime image). Backend.py changes can be
swapped live in /backends/<id>/backend.py + /backend/shutdown.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): LocalAI Assistant ("Manage Mode") for the Talk page
Mirrors the chat-page metadata.localai_assistant flow so users can ask the
realtime model what's loaded / installed / configured. Tools are run
server-side via the same in-process MCP holder that powers the chat
modality — no transport switch, no proxy, no new wire protocol.
Wire:
- core/http/endpoints/openai/realtime.go:
- RealtimeSessionOptions{LocalAIAssistant,IsAdmin}; isCurrentUserAdmin
helper mirrors chat.go's requireAssistantAccess (no-op when auth
disabled, else requires auth.RoleAdmin).
- Session grows AssistantExecutor mcpTools.ToolExecutor.
- runRealtimeSession, when opts.LocalAIAssistant is set: gate on admin,
fail closed if DisableLocalAIAssistant or the holder has no tools,
DiscoverTools and inject into session.Tools, prepend
holder.SystemPrompt() to instructions.
- Tool-call dispatch loop: when AssistantExecutor.IsTool(name), run
ExecuteTool inproc, append a FunctionCallOutput to conv.Items, skip
the function_call_arguments client emit (the client can't execute
these — it doesn't know about them). After the loop, if any
assistant tool ran, trigger another response so the model speaks the
result. Mirrors chat's agentic loop, driven server-side rather than
via client round-trip.
- core/http/endpoints/openai/realtime_webrtc.go: RealtimeCallRequest
gains `localai_assistant` (JSON omitempty). Handshake calls
isCurrentUserAdmin and builds RealtimeSessionOptions.
- core/http/react-ui/src/pages/Talk.jsx: admin-only "Manage Mode"
checkbox under the Tools dropdown; passes localai_assistant: true to
realtimeApi.call's body, captured in the connect callback's deps.
Mirroring chat's pattern means the in-process MCP tools surface "just
works" for the Talk page without exposing a Streamable-HTTP MCP endpoint
(which was the alternative). Clients with their own MCP servers can
still use the existing ClientMCPDropdown path in parallel; the realtime
handler distinguishes them by AssistantExecutor.IsTool() at dispatch
time.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(realtime): render Manage Mode tool calls in the Talk transcript
Previously the realtime endpoint only emitted response.output_item.added
for the FunctionCall item, and Talk.jsx's switch ignored the event — so
server-side tool runs were invisible in the UI. The model would speak
the result but the user had no way to see what tool was actually
called.
realtime.go: after executing an assistant tool inproc, emit a second
output_item.added/.done pair for the FunctionCallOutput item. Mirrors
the way the chat page displays tool_call + tool_result blocks.
Talk.jsx: handle both response.output_item.added and .done. Render
FunctionCall (with arguments) and FunctionCallOutput (pretty-printed
JSON when possible) as two transcript entries — `tool_call` with the
wrench icon, `tool_result` with the clipboard icon, both in mono-space
secondary-colour. Resets streamingRef after the result so the next
assistant text delta starts a fresh transcript entry instead of
appending to the previous turn.
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* refactor(realtime): bound the Manage Mode tool-loop + preserve assistant tools
Fallout from a review pass on the Manage Mode patches:
- Bound the server-side agentic loop. triggerResponse used to recurse on
executedAssistantTool with no cap — a model that kept calling tools
would blow the goroutine stack. New maxAssistantToolTurns = 10 (mirrors
useChat.js's maxToolTurns). Public triggerResponse is now a thin shim
over triggerResponseAtTurn(toolTurn int); recursion increments the
counter and stops at the cap with an xlog.Warn.
- Preserve Manage Mode tools across client session.update. The handler
used to blindly overwrite session.Tools, so toggling a client MCP
server mid-session silently wiped the in-process admin tools. Session
now caches the original AssistantTools slice at session creation and
the session.update handler merges them back in (client names win on
collision — the client is explicit).
- strconv.ParseBool for the localai_assistant query param instead of
hand-rolled "1" || "true". Mirrors LocalAIAssistantFromMetadata.
- Talk.jsx: render both tool_call and tool_result on
response.output_item.done instead of splitting them across .added and
.done. The server's event pairing (added → done) stays correct; the
UI just doesn't need to inspect both phases of the same item. One
switch case instead of two, no behavioural change.
Out of scope (noted for follow-ups): extract a shared assistant-tools
helper between chat.go and realtime.go (duplication is small enough
that two parallel implementations stay readable for now), and an i18n
key for the Manage Mode helper text (Talk.jsx doesn't use i18n
anywhere else yet).
Assisted-by: claude-code:claude-opus-4-7-1m [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* ci(test-extra): wire liquid-audio backend smoke test
The backend ships test.py + a `make test` target and is listed in
backend-matrix.yml, so scripts/changed-backends.js already writes a
`liquid-audio=true|false` output when files under backend/python/liquid-audio/
change. The workflow just wasn't reading it.
- Expose the `liquid-audio` output on the detect-changes job
- Add a tests-liquid-audio job that runs `make` + `make test` in
backend/python/liquid-audio, gated on the per-backend detect flag
The smoke covers Health() and LoadModel(mode:finetune); fine-tune mode
short-circuits before any HuggingFace download (backend.py:192), so the
job needs neither weights nor a GPU. The full-inference path remains
gated on LIQUID_AUDIO_MODEL_ID, which CI doesn't set.
The four new Go test files (core/gallery/importers/liquid-audio_test.go,
core/http/endpoints/openai/realtime_gate_test.go,
core/http/routes/ui_pipeline_models_test.go, pkg/functions/parse_lfm2_test.go)
are already picked up by the existing test.yml workflow via `make test` →
`ginkgo -r ./pkg/... ./core/...`; their packages all carry RunSpecs entries.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Follow-up to PR #9781. v4.2.2 (run 25745181433) showed the keepalive
anchor in ci-cache wasn't enough on its own: 19 of 37 multiarch merges
still failed with "manifest not found" for the same digests we'd just
anchored.
Quay's manifest GC is per-repository. The anchor tag in ci-cache
protects the manifest copy that lives in ci-cache, but the same digest
in local-ai-backends is independently tracked and gets reaped because
nothing in local-ai-backends references it (push-by-digest=true leaves
it untagged). The merge then asks
`local-ai-backends@sha256:<digest>` and quay correctly says "not found"
in that repo, even though `ci-cache@sha256:<digest>` is alive and well.
Empirical confirmation against a live failed digest from v4.2.2:
$ docker buildx imagetools inspect quay.io/go-skynet/ci-cache@sha256:05377fe6...
Name: quay.io/go-skynet/ci-cache@sha256:05377fe6...
MediaType: application/vnd.docker.distribution.manifest.v2+json
$ docker buildx imagetools inspect quay.io/go-skynet/local-ai-backends@sha256:05377fe6...
ERROR: ... not found
Switch the source of the quay merge step to ci-cache. The blobs the
manifest references are already accessible from local-ai-backends
(verified via direct registry HEAD: HTTP 200 from both repos — the
original push cross-mounted blobs at content-addressable storage time
and they outlive the per-repo manifest GC). buildx imagetools create
republishes the manifest into local-ai-backends, then writes the
user-facing manifest list pointing at it. End state is self-contained:
the published manifest list references child manifests by digest only,
no embedded reference to ci-cache.
Dockerhub merge step is unchanged. Dockerhub's GC isn't aggressive
enough to reap untagged manifests at the timescales we operate on
(verified: localai/localai-backends@<same digest> still resolves cleanly
after >24h).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994
Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.
Adapt the grpc-server wrapper accordingly:
* `common_params_speculative::type` (single enum) became `types`
(`std::vector<common_speculative_type>`). Update both the
"default to draft when a draft model is set" branch and the
`spec_type`/`speculative_type` option parser. The parser now also
tolerates comma-separated lists, mirroring the upstream
`common_speculative_types_from_names` semantics.
* `common_params_speculative_draft::n_ctx` is gone (draft now shares
the target context size). Keep the `draft_ctx_size` option name for
backward compatibility and ignore the value rather than failing.
* `server_context_impl::model` was renamed to `model_tgt`; update the
two reranker / model-metadata call sites.
Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.
New `options:` keys (all under `backend: llama-cpp`):
ngram_mod (`ngram_mod` type):
spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match
ngram_map_k (`ngram_map_k` type):
spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits
ngram_map_k4v (`ngram_map_k4v` type):
spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
spec_ngram_map_k4v_min_hits
ngram lookup caches (`ngram_cache` type):
spec_lookup_cache_static / lookup_cache_static
spec_lookup_cache_dynamic / lookup_cache_dynamic
Draft-model tuning (active when `spec_type` is `draft`):
draft_cache_type_k / spec_draft_cache_type_k
draft_cache_type_v / spec_draft_cache_type_v
draft_threads / spec_draft_threads
draft_threads_batch / spec_draft_threads_batch
draft_cpu_moe / spec_draft_cpu_moe (bool flag)
draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU)
draft_override_tensor / spec_draft_override_tensor
(comma-separated <tensor regex>=<buffer type>; re-implements upstream's
static parse_tensor_buffer_overrides since it isn't exported)
`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.
Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.
Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout
The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:
* `ctx_server.impl->model_tgt` (fork still has `model`)
* `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
(none of these sub-structs exist in the fork)
* `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
* `params.speculative.types` vector / `common_speculative_types_from_names`
(fork has a scalar `type` and only the singular helper)
Approach:
1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
`LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
discriminations (the "default to draft when a draft model is set" branch
and the `spec_type` / `speculative_type` option parser) fall back to the
singular scalar form, and the entire new-option block (ngram_mod / map_k
/ map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
in the source tree — stock llama-cpp builds get the full new API.
2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
- substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
- inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
`#include`, so the guarded blocks above drop out for the fork build.
Both patches are idempotent and follow the existing sed/awk pattern in
this script (KV cache types, `get_media_marker`, flat speculative
renames). Stock llama-cpp's `grpc-server.cpp` is never touched.
Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): close draft_ctx_size brace inside legacy guard
The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.
Move the chain split inside the draft_ctx_size branch:
} else if (... "draft_ctx_size") {
// ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
} // legacy: chain ends here
#else
} else if (... "spec_ngram_mod_n_min") { // modern: chain continues
...
} else if (... "draft_override_tensor") {
...
} // closes last branch
#endif
} // closes for-loop
Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).
Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt
Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.
backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:
Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
list of gfx targets e.g. gfx1100,gfx1101. Stop.
make: *** [Makefile:66: turboquant-fallback] Error 2
The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.
Mirror the existing pattern from Dockerfile.llama-cpp.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* ci: close the GC race + cascade-skip + darwin grpc gaps from v4.2.1
v4.2.1's backend.yml run (#25701862853) exposed three independent issues
on top of the singletons fix shipped in ea001995. Address all three plus
two related cleanups:
1. quay GC race in backend-merge-jobs-multiarch (12/37 merges failed with
"manifest not found"). Even after PR #9746 split multi/single-arch
merges, the multiarch matrix itself takes ~2h to drain at
max-parallel: 8, and the earliest per-arch digests (push-by-digest,
no tag) get reaped by quay's GC before the merge runs. The split
bounded the race for multiarch; it doesn't eliminate it. Anchor each
per-arch digest immediately to a tag in the internal ci-cache image
(`keepalive-<run_id><tag-suffix>-<platform-tag>`). Quay won't GC
tagged manifests. backend_merge.yml deletes the keepalive tags via
quay REST API after publishing the user-facing manifest list.
Cleanup is best-effort: if the quay token is not OAuth-scoped the
merge does NOT fail, the orphan tags just persist.
2. cascade-skip on backend-merge-jobs-singlearch. v4.2.1 had 2 failed
and 2 cancelled singlearch builds (out of 199); GHA's default
`needs:` semantics cascade-skipped the entire singlearch merge
matrix, so zero singleton tags were applied even though 197
singletons built successfully. Wrap the merge `if:` in
`!cancelled() && ...` for both multi and single arch in backend.yml
and backend_pr.yml so partial build failures publish the successful
tag-suffixes.
3. Darwin llama-cpp grpc-server build fails with `find_package(absl)`
not found. Same shape as the ccache/blake3/fmt/hiredis/xxhash/zstd
fix already in `Dependencies`: a brew cache hit restores
`/opt/homebrew/Cellar/grpc` so `brew install grpc` no-ops, but
abseil isn't in our Cellar cache list and never gets installed
alongside, leaving grpc's CMake unable to resolve it. Mirror the
`brew reinstall ccache` line with `brew reinstall grpc` to
re-validate grpc's full transitive dep closure on every cache-hit
run.
4. Move the four heaviest CUDA cpp builds back to bigger-runner. v4.2.1
wall-clock: -gpu-nvidia-cuda-12-llama-cpp 5h36m,
-gpu-nvidia-cuda-12-turboquant 6h05m,
-gpu-nvidia-cuda-13-llama-cpp 5h37m,
-gpu-nvidia-cuda-13-turboquant 6h05m. The cuda-12 turboquant and
cuda-13 turboquant entries are over GHA's 6h job timeout. Phase 5.3
of the free-tier migration (PR #9730) had explicitly flagged this
batch as 'highest-risk' with a per-entry revert path. All other
matrix entries (vulkan-llama-cpp ~47m, ROCm hipblas-llama-cpp ~2h,
intel sycl-f32 ~1h49m) stay on free-tier ubuntu-latest.
Verified locally: all six edited workflow YAMLs parse cleanly. Real
verification has to come from the next tag release run.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* ci: extract keepalive anchor + cleanup into .github/scripts/
The two inline shell blocks from the previous commit are long enough to
hurt readability of the workflow YAML and benefit from their own files
with self-contained docs. Move them to .github/scripts/:
anchor-digest-in-cache.sh backend_build.yml's keepalive anchor
cleanup-keepalive-tags.sh backend_merge.yml's best-effort cleanup
Workflow steps reduce to a single `run:` invocation each, with all the
parameter plumbing handled by env vars on the step. backend_merge.yml
also gains a sparse `actions/checkout@v6` step (sparse to .github/scripts
only) so the cleanup script is available on the runner — backend_build
already checks out for the docker build.
Net workflow diff: -36 lines across the two files. Script logic and
behavior are byte-identical to the inline version.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Ollama's embedding endpoint accepts both `input` and `prompt` as the
input string value (see ollama/ollama docs/api.md#generate-embeddings).
LocalAI only accepted `input`, which broke client libraries that send
the `prompt` form.
Add `Prompt` to OllamaEmbedRequest and have GetInputStrings fall back
to it when Input is unset. Input still wins when both are provided.
Fixes#9767.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
* fix: parse vulkan VRAM from text
Assisted-by: opencode:gpt-5.5
Signed-off-by: Andreas Egli <github@kharan.ch>
* fix: replace string.split with streaming iteration
Assisted-by: Opencode:Gemma4
Signed-off-by: Andreas Egli <github@kharan.ch>
---------
Signed-off-by: Andreas Egli <github@kharan.ch>
@@ -112,6 +112,8 @@ Add a YAML anchor definition in the `## metas` section (around line 2-300). Look
Add image entries at the end of the file, following the pattern of similar backends such as `diffusers` or `chatterbox`. Include both `latest` (production) and `master` (development) tags.
**Note on integrity:** OCI backends installed from a gallery whose `verification:` block is set are verified against a keyless-cosign policy before extraction; tarball/HTTP backends use the optional `sha256:` field. New backends do not need any extra YAML — the gallery-level `verification:` block covers every entry. See [.agents/backend-signing.md](backend-signing.md) for the producer-side CI step.
## 4. Update the Makefile
The Makefile needs to be updated in several places to support building and testing the new backend:
@@ -284,7 +284,17 @@ Also bump the expected-length count in `api_instructions_test.go` and add the na
### 3. `capabilities.js` symbol (for new model-config FLAG_* flags)
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), also declare the matching symbol in `core/http/react-ui/src/utils/capabilities.js`:
If your feature needs a new `FLAG_*` usecase flag in `core/config/model_config.go` (so users can filter gallery models by it, and so `/v1/models` surfaces it), you need to update **all** of:
-`Usecase<Name>` string constant in `core/config/backend_capabilities.go`
-`UsecaseInfoMap` entry mapping the string to its flag + gRPC method
-`FLAG_<NAME>` bitmask in `core/config/model_config.go`
-`GetAllModelConfigUsecases()` map entry (otherwise the YAML loader silently ignores the string)
-`ModalityGroups` membership if the flag should affect `IsMultimodal()` (e.g. realtime_audio is in both speech-input and audio-output groups so a lone flag still reads as multimodal)
-`GuessUsecases()` branch listing the backends that own this capability
-`usecaseFilters` in `core/http/routes/ui_api.go` (drives the gallery filter dropdown)
-`Models.jsx``FILTERS` array + matching `filters.<camelCase>` i18n key in `core/http/react-ui/public/locales/en/models.json`
@@ -15,3 +15,35 @@ Let's say the user wants to build a particular backend for a given platform. For
- Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
- The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
- Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
## Test coverage gate
The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
-`make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
- **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
- **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
- **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
- **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
- **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
-`make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
-`make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
-`make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
### React UI coverage
The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
-`make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
-`.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending).
- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
Rules (both gates):
- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
@@ -50,6 +50,17 @@ Do not mix styles within a package. If you are extending tests in a package that
This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).
## Outbound HTTP
All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
## Documentation
The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
@@ -61,6 +61,12 @@ Always check `llama.cpp` for new model configuration options that should be supp
-`reasoning_format` - Reasoning format options
- Any new flags or parameters
### Speculative Decoding Types
The `spec_type` option in `grpc-server.cpp` delegates to upstream's `common_speculative_types_from_names()`, so new speculative types added to the `common_speculative_type_from_name` map in `common/speculative.cpp` are picked up automatically with no code changes - only docs need an entry in `docs/content/advanced/model-configuration.md`. Current values: `none`, `draft-simple`, `draft-eagle3`, `draft-mtp`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, `ngram-mod`, `ngram-cache`.
`draft-mtp` (Multi-Token Prediction, [ggml-org/llama.cpp#22673](https://github.com/ggml-org/llama.cpp/pull/22673)) does not need a separate draft GGUF: when `spec_type` includes `draft-mtp` and `draftmodel` is empty, the upstream server creates an MTP context off the target model itself. LocalAI's gRPC layer needs no changes for this — it works through the existing `params.speculative.types` plumbing and the derived `cparams.n_rs_seq = params.speculative.need_n_rs_seq()` in `common_context_params_to_llama`.
### Implementation Guidelines
1.**Feature Parity**: Always aim for feature parity with llama.cpp's implementation
stale-issue-message:'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
stale-pr-message:'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
msg:'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.Fail. See .agents/coding-style.md.'
- pattern:'^t\.FailNow$'
msg:'LocalAI tests must use Ginkgo/Gomega; use Fail(...) instead of t.FailNow. See .agents/coding-style.md.'
# In-process config should flow through ApplicationConfig / kong-bound
# CLI flags, not via os.Getenv. The CLI layer is the legitimate
# env→struct boundary (kong's `env:"..."` tag); anything deeper that
# reads env directly leaks process state into business logic and
# makes flags impossible to test or override per-request. Backend
# subprocesses, the system/capabilities probe, and a few places that
# read non-LocalAI env vars (HOME, PATH, AUTH_TOKEN passed by parent)
# are exempt — see linters.exclusions.rules below.
- pattern:'^os\.(Getenv|LookupEnv|Environ)$'
msg:'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
# Outbound HTTP must go through pkg/httpclient, which refuses redirects
# by default and sets a TLS floor. The std-library default client and
# the http.Get/Post/... convenience helpers follow redirects (up to 10)
# and, on a cross-host redirect, forward custom credential headers such
# as Anthropic's x-api-key to the redirect target — leaking the secret
# (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
# `&http.Client{}` composite literal without also flagging legitimate
# `*http.Client` type references, so that form is enforced by
# convention + review; these two patterns catch the implicit-default
# client, which is the common footgun.
- pattern:'^http\.DefaultClient$'
msg:'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
- pattern:'^http\.(Get|Post|PostForm|Head)$'
msg:'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
exclusions:
paths:
# Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
- 'backend/go/whisper/sources'
- 'docs/'
rules:
# CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
# boundary, and a handful of subcommands legitimately propagate values
# to spawned subprocesses (LLAMACPP_GRPC_SERVERS, MLX hostfile, ...).
- path:^core/cli/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# Backend subprocesses are independent binaries with their own env
# surface; they're not "in-process config" of the LocalAI server.
- path:^backend/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# System capability probe reads HOME, PATH-style vars to discover
# GPUs, default paths, etc. — not LocalAI config.
- path:^pkg/system/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# gRPC server reads AUTH_TOKEN passed in by the parent process at spawn
# time; model.Loader sets/inherits env to communicate with subprocesses.
- path:^pkg/grpc/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
- path:^pkg/model/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# Top-level main binaries (local-ai, launcher) are entry points.
- path:^cmd/
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# Tests legitimately read $HOME, $TMPDIR, and gating env vars
# (LOCALAI_COSIGN_LIVE, etc.) to skip live-network specs.
- path:_test\.go$
text:'os\.(Getenv|LookupEnv|Environ)'
linters:[forbidigo]
# pkg/httpclient is the sanctioned home for outbound HTTP clients; it
- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
- **Logging**: Use `github.com/mudler/xlog` (same API as slog)
@@ -198,6 +198,7 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C
- Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
- Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
- Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
- Use tab indentation for Go files (as defined in `.editorconfig`).
@@ -265,6 +266,12 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
make test-e2e
```
### React UI tests and coverage
The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
### Running E2E container tests
These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready** — API key auth, user quotas, role-based access
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
- **Open and extensible**: load any model, or build your own backend in any language against an open interface
- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready**: API key auth, user quotas, role-based access
- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first**: your data never leaves your infrastructure

Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).
@@ -149,8 +155,10 @@ For more details, see the [Getting Started guide](https://localai.io/basics/gett
## Latest News
- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
- **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
- **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
// Happens when CPP vector has not had any elements pushed to it
ifsegsPtr==0{
returnpb.VADResponse{
Segments:[]*pb.VADSegment{},
},nil
}
// unsafeptr warning is caused by segsPtr being on the stack and therefor being subject to stack copying AFAICT
// however the stack shouldn't have grown between setting segsPtr and now, also the memory pointed to is allocated by C++
segs:=unsafe.Slice((*float32)(unsafe.Pointer(segsPtr)),segsLen)//nolint:govet // segsPtr addresses C++-owned heap memory passed back through the cgo-free purego boundary; the uintptr->Pointer round-trip is intentional and the buffer outlives this read.
returnnil,fmt.Errorf("crispasr: synthesis failed (the loaded model may not be a supported TTS backend, or needs extra config e.g. orpheus SNAC codec)")
}
deferCppTTSFree(ptr)
src:=unsafe.Slice((*float32)(unsafe.Pointer(ptr)),int(n))//nolint:govet // ptr addresses C-allocated PCM returned across the purego boundary; copied out immediately below, before tts_free.
out:=make([]float32,int(n))// copy out of C memory before free
copy(out,src)
returnout,nil
}
// setVoice applies a per-call speaker/voice override (best effort). CrispASR
// returns a negative code when the active backend can't honor the name; we log
// it rather than fail, so an unknown voice falls back to the default speaker.
funcsetVoice(voicestring){
v:=strings.TrimSpace(voice)
ifv==""{
return
}
ifrc:=CppTTSSetVoice(v);rc!=0{
fmt.Fprintf(os.Stderr,"crispasr: voice %q not applied by the active TTS backend (rc=%d); using default\n",v,rc)
}
}
func(w*CrispASR)TTS(req*pb.TTSRequest)error{
ifreq.Dst==""{
returnfmt.Errorf("crispasr: TTS requires a destination path")
}
setVoice(req.Voice)
pcm,err:=w.synthesize(req.Text)
iferr!=nil{
returnerr
}
returnwriteWAV24k(req.Dst,pcm)
}
// TTSStream is the streaming counterpart to TTS. CrispASR has no progressive
// (native streaming) synth, so we synthesize the whole utterance, encode it to
// a 24 kHz WAV, and emit the encoded bytes as a single chunk. The gRPC server
// wrapper (pkg/grpc/server.go:TTSStream) ranges over the channel until it is
// closed, so this method owns the close - mirrors vibevoice-cpp's TTSStream.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.