LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-07-13 17:54:02 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	e4c70fca7a	fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk (#10000 ) When the C++ autoparser is in pure-content fallback mode (qwen3-4b after model emits a tool-call JSON in non-thinking mode, the streaming worker ended the SSE stream with a spurious data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}} chunk carrying the same JSON that was already in delta.tool_calls. The Go-side ReasoningExtractor is configured from DetectThinkingStartToken, which scans the model's jinja chat template verbatim and finds <think> inside an {% if enable_thinking %} block without evaluating the conditional. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and makes ExtractReasoning treat everything after as reasoning. The autoparser correctly classifies zero reasoning (qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content), but processStreamWithTools then preferred extractor.Reasoning() over functions.ReasoningFromChatDeltas at the end-of-stream flush — handing the polluted Go-side state to buildDeferredToolCallChunks, which emitted it as a trailing reasoning chunk. Two changes: * Add a sticky preferAutoparser flag to processStreamWithTools, mirroring the analogous flag in processStream from #9985. Once any ChatDelta carries content or reasoning, the flag stays on for the rest of the stream and the worker stops falling back to the Go-side extractor for per-token deltas. This avoids the per-chunk leak path and the cumulative pollution. * Extract chooseDeferredReasoning, a small helper that selects the end-of-stream reasoning source. When preferAutoparser is set, return functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to extractor.Reasoning() (the correct source for vLLM and other backends with no autoparser). The helper has a focused test suite covering both sides of the contract: autoparser-active with empty reasoning (the qwen3 case — the fix's purpose), autoparser-active with real reasoning_content (jinja-with-recognized-format models), and autoparser-not-active with genuine Go-side reasoning (vLLM-style backends). E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with name='exec' and the right arguments, finish_reason=tool_calls, and zero reasoning chunks — down from one polluted reasoning chunk before this fix. Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end; the helper unit tests are independent. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-26 08:34:26 +02:00
dependabot[bot]	4b398c9798	chore(deps): bump github.com/nats-io/nats.go from 1.50.0 to 1.52.0 (#10003 ) Bumps [github.com/nats-io/nats.go](https://github.com/nats-io/nats.go) from 1.50.0 to 1.52.0. - [Release notes](https://github.com/nats-io/nats.go/releases) - [Commits](https://github.com/nats-io/nats.go/compare/v1.50.0...v1.52.0) --- updated-dependencies: - dependency-name: github.com/nats-io/nats.go dependency-version: 1.52.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-26 08:34:09 +02:00
LocalAI [bot]	4a5219fa9c	chore: ⬆️ Update ggml-org/whisper.cpp to `e0fd1f6787a5bd4a4957dd97c5b64df882ee7b0c` (#9997 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-26 08:33:53 +02:00
LocalAI [bot]	b5a620294e	chore: ⬆️ Update leejet/stable-diffusion.cpp to `1ceb5bd9df7784bcdf67dd9ed8bf0198b542ebc9` (#9994 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-26 08:33:37 +02:00
LocalAI [bot]	5d544a7868	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b4e1d916c5ec7e75ea3c124dd090425a99fc613f` (#9995 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 23:57:17 +02:00
LocalAI [bot]	87e01aa290	chore: ⬆️ Update antirez/ds4 to `ad0209f6a4b067574d2b4afe896c08c177156b31` (#9996 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 23:56:33 +02:00
LocalAI [bot]	f17d99f6e5	fix(streaming/tools): stop healing-marker stubs from gating off content (#9999 ) * fix(streaming/tools): stop healing-marker stubs from gating off content When the C++ autoparser is in pure-content fallback mode (e.g. qwen3 without --jinja) and the model emits a tool call as JSON, the streaming worker calls ParseJSONIterative on each new chunk. parseJSONWithStack heals partial input like `{` into `{"<marker>":1}` where <marker> is a random integer. removeHealingMarkerFromJSON only stripped the marker from values, so the synthetic key survived and downstream callers saw a stub object with a random-looking key. chat_stream_workers.go's JSON tool-call detector then bumped lastEmittedCount past the stub even though no real tool call was emitted, gating off ALL subsequent content chunks. The qwen3 + tools + streaming case ended up dribbling only the first `{"` to clients and then nothing, even when the model went on to call the noAction `answer({"message": "…"})` pseudo-tool. Three changes, each with its own regression test: * removeHealingMarkerFromJSON now strips the marker suffix from keys too, dropping the entry when the truncated key is empty. Inputs like `{` no longer leak `{"<marker>":1}` to callers; partial keys like `{ "code` still preserve the model-typed prefix `code`. * ParseJSONIterative skips empty-after-healing maps so a healed `{` doesn't surface as a stub result. * The streaming JSON detector now breaks (not continues) on entries without a usable `name`, and only bumps lastEmittedCount past successfully-emitted entries. Defense-in-depth against any future partial-parse shape. The parser tests cover eight partial-JSON-prefix shapes and verify no marker characters leak into keys, plus the two early shapes (`{`, `{"`) that should not surface a stub at all. Fixes #9988 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(streaming/tools): cover the autoparser-correctly-working path Extract the JSON tool-call streaming emit loop into emitJSONToolCallDeltas and unit-test it against every shape that can hit the streaming worker: * the bug case — a healing-marker stub at index 0 must NOT bump lastEmittedCount, so subsequent content chunks keep flowing; * the autoparser-correctly-working case — empty jsonResults (because the C++ autoparser cleared the raw text and delivers tool calls via TokenUsage.ChatDeltas) is a no-op, leaving the deferred end-of-stream emitter to ship the autoparser's tool calls; * a single complete tool call — emit one chunk, advance to 1; * arguments arriving as a JSON-string vs as a nested object — both serialize to the wire as JSON-string arguments; * multiple parallel tool calls — one chunk each; * a real tool call followed by a partial stub — emit the real one, stop at the stub, resume on a later chunk once the stub completes. Locks down the no-regression guarantee the user asked for: this PR's fix is scoped to the pure-content fallback path; when the autoparser actually classifies tool calls (jinja-recognized chat format with tool support), the helper is a no-op and nothing changes. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 23:55:35 +02:00
LocalAI [bot]	597daa925b	docs: ⬆️ update docs version mudler/LocalAI (#9993 ) ⬆️ Update docs version mudler/LocalAI Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 22:40:52 +02:00
LocalAI [bot]	3e0612b8b4	feat(swagger): update swagger (#9992 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 22:40:32 +02:00
LocalAI [bot]	de2ce74bea	fix(stablediffusion-ggml): mux LTX-2 audio into output MP4 (#9990 ) feat(stablediffusion-ggml): mux LTX-2 audio into output MP4 sd.cpp's generate_video now returns a sd_audio_t* alongside the video frames for models with an audio VAE (LTX-2.3). Our gosd wrapper was already collecting that pointer but immediately freed it without ever muxing it into the output, so LTX-2 generations landed as silent MP4s even though the audio VAE decode succeeded. Stage the planar float32 waveform to a temp WAV (IEEE float, header hand-built; samples interleaved on the fly), then add it as a second ffmpeg input with -c:a aac -map 0:v:0 -map 1:a:0 -shortest. The temp WAV is cleaned up unconditionally after ffmpeg exits, including on the write/waitpid error paths. Non-LTX models (Wan i2v / FLF2V) keep their current behaviour: audio arg is nullptr, the audio-related ffmpeg flags are not added, and no temp file is created. Assisted-by: Claude:claude-opus-4-7 Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 22:40:16 +02:00
LocalAI [bot]	1c6c3adad6	fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode (#9991 ) When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery family), llama.cpp's chat parser falls back to a "pure content" PEG parser that dumps the entire raw response into ChatDelta.Content with an empty ReasoningContent. The Go side then trusted that content verbatim and overrode tokenCallback's correctly-split reasoning, so <think>...</think> blocks ended up in the OpenAI `content` field. Regression from v4.0.0 introduced when the autoparser ChatDeltas path was added (#9224). The override now runs Go-side reasoning extraction defensively when the autoparser delivered content but no reasoning. The streaming worker gains a sticky preferAutoparser flag that flips on the first chunk where the autoparser classified reasoning_content; until then we use the streaming Go-side extractor. Realtime mirrors the non-streaming fallback. When the autoparser already populated ReasoningContent we trust it untouched, so jinja-enabled installs are not regressed. gallery/qwen3.yaml now enables use_jinja, letting the autoparser classify <think> natively for all 20+ qwen3 family entries that share this template. Fixes #9985 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 22:39:50 +02:00
Ettore Di Giacinto	c2cd3b9ada	fix(gallery/ltx-2.3): add vae_decode_only:false for i2v / flf2v (#9987 ) LTX-2.3 i2v inference fails inside generate_video with: [ERROR] LTXAV image conditioning requires VAE encoder weights; create the context with vae_decode_only=false Without vae_decode_only:false in the options block, gosd.cpp creates the sd_ctx with VAE encoder weights freed, so latent encoding of the init_image is impossible. Adding the option mirrors what we already do for Wan i2v entries. Affects all six LTX-2.3 entries (dev/distilled × UD-Q4_K_M, Q4_K_M, Q8_0). T2V wasn't impacted by the missing option since it has no init image to encode, which is why the T2V smoke earlier passed. Assisted-by: Claude:claude-opus-4-7	2026-05-25 21:40:12 +02:00
Ettore Di Giacinto	9ff270eb65	fix(gallery/ltx-2.3): add diffusion_model flag to all variants (#9986 ) LTX-2.3 entries (dev / distilled, UD-Q4_K_M / Q4_K_M / Q8_0) were missing the `diffusion_model` option in their overrides. Without it, gosd.cpp routes the main GGUF through the regular `model_path` code path in sd.cpp, which doesn't apply the `model.diffusion_model.` tensor prefix. sd.cpp's LTX-2.3 architecture detection (`VERSION_LTXAV`) in get_sd_version checks for prefixed tensor names — without the prefix, detection fails and load_model returns "could not load model". This is the same bug we hit for Wan when the option was missing. Adding `- diffusion_model` to all six LTX-2.3 entries' option blocks makes load_model take the diffusion_model_path branch so detection succeeds. Assisted-by: Claude:claude-opus-4-7	2026-05-25 21:10:40 +02:00
LocalAI [bot]	8d6548c0b9	fix(distributed): sync gallery OpCache + caches across frontend replicas (#9983 ) When the LocalAI frontend deployment is scaled past one replica, the UI's /api/operations poll round-robins between pods. Each pod kept the OpCache (galleryID->jobID), OpStatus map, and the post-install in-memory caches (ModelConfigLoader, UpgradeChecker) purely in-process. Reads never consulted PostgreSQL or NATS even though writes already published to PG. Symptoms: - A user installing a model on replica A saw the operation card flicker in and out as the load balancer alternated. - The Models page re-fetched the whole gallery on every flicker because useEffect([operations.length]) re-fires when the count changes. - A chat completion that landed on replica B after the install completed on replica A failed to find the new model — B's ModelConfigLoader was still the old one because nothing told it to reload from disk. - The UpgradeChecker 6-hour cache stayed stale on peer replicas after a backend upgrade, so /api/backends/upgrades kept surfacing an upgrade that had already shipped. Mirror the jobs Dispatcher pattern for gallery ops: - OpCache learns SetMessagingClient/SetGalleryStore + a Start(ctx) that hydrates from PostgreSQL and subscribes to gallery.opcache.{start,end}. Set/SetBackend now upsert cache_key + is_backend_op on the gallery_ operations row and broadcast OpCacheEvent so peers merge it in. The hydrate path uses a new GalleryStore.ListActive() (status in {pending, downloading, processing} and updated within 30 min). - GalleryService.SubscribeBroadcasts wires a SubjectGalleryProgress- Wildcard subscriber that calls a new lock-light mergeStatus into the local statuses map, plus a SubjectGalleryCancelWildcard subscriber that runs the locally-registered cancel func. Hydrate() restores active rows from PostgreSQL on startup so a freshly-started replica is not observably empty mid-install. CancelOperation tolerates the cancel func living on a different replica and publishes anyway. - modelHandler and backendHandler publish on the new SubjectCacheInvalidateModels / SubjectCacheInvalidateBackends after a successful install/delete/upgrade. SubscribeBroadcasts wires peers to refresh: OnModelsChanged (re-runs LoadModelConfigsFromPath) and OnBackendOpCompleted (re-triggers UpgradeChecker). The originating replica reloads inline so it never enters the broadcast handler. - OpStatus.Error (an error interface) flat-marshalled to "{}" over JSON, so a failed install replicated to a peer arrived with a nil error and the UI's failure banner never appeared. Add MarshalJSON/UnmarshalJSON via an opStatusWire shim that round-trips Error as a string. - UpdateStatus and CancelOperation now drop the mutex before publishing to NATS or persisting to PostgreSQL. The wildcard subscriber's mergeStatus loops back into the same service on the publishing replica and would deadlock otherwise; this also prevents future PG round-trips from stalling concurrent readers on every progress tick. Tests cover the OpStatus error round-trip, OpCache propagation through a shared in-memory bus, OpCache PostgreSQL hydration (active-only), GalleryService progress + cancel broadcast, Nodes preservation across a peer's bare progress tick, GalleryService hydration from PG, and the two cache-invalidation broadcasts (models + backends). 44 specs total in galleryop; routes/operations specs and jobs/agents suites still pass. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 17:28:14 +02:00
LocalAI [bot]	b02e3ffe61	feat(stablediffusion-ggml): LTX-2 support + LTX-2.3 GGUF gallery entries (#9980 ) stable-diffusion.cpp gained LTX-2 video generation, which requires an audio VAE and an embeddings_connectors safetensors in addition to the usual diffusion model, VAE, and LLM text encoder. The pinned commit exposes audio_vae_path and embeddings_connectors_path on sd_ctx_params_t; wire both through the option parser so gallery entries can point at the LTX-specific assets. Ship six LTX-2.3 GGUF gallery entries (dev + distilled, UD-Q4_K_M / Q4_K_M / Q8_0 each) backed by a new ltx-ggml.yaml template that defaults to euler / cfg_scale 6.0 / vae_decode_only:false / diffusion_flash_attn / offload_params_to_cpu — matching the upstream LTX-2 CLI recipe. Each entry pulls the model GGUF plus the QAT gemma-3-12b-it text encoder, video VAE, audio VAE, and embeddings connectors needed for T2V / I2V / FLF2V. Assisted-by: Claude:claude-opus-4-7 [Claude-Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 13:00:28 +02:00
LocalAI [bot]	a891eedd08	fix(distributed): persist per-model load info so reconciler survives frontend restart (#9981 ) * feat(distributed): add per-model ModelLoadInfo persistence Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from the per-replica NodeModel rows. The reconciler can now recover model load metadata after every NodeModel row has been removed (worker death, eviction, MarkOffline reaping, frontend restart with stale heartbeats), which is the read side of Bug-1 from the distributed mode bug hunt. Registry exposes: - UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins, matching the existing per-replica blob semantics under concurrent multi-frontend dispatch. - GetModelLoadInfo: read from the new table first; fall back to the legacy NodeModel-blob scan for rows written before any frontend in the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition). SetNodeModelLoadInfo (per-replica blob) is preserved for backward compatibility and per-replica diagnostics; the dispatch-path hook in the next commit calls both. The new table joins the existing nodes AutoMigrate set under the same schema-migration advisory lock. Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * fix(distributed): persist per-model load info on dispatch scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to the new ModelLoadInfo table in addition to the existing per-replica NodeModel.model_opts_blob field. The per-replica blob still works for the hot path; the per-model row outlives every NodeModel row going away, which is what unblocks the reconciler on the read side. Both writes are best-effort with warn-level logging on failure: a write miss here just means the reconciler may need a fresh inference request to repopulate, which is the pre-fix behavior. Concurrency: two frontends loading the same model at the same time both fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row converge to whichever commits last. Matches the existing per-replica blob semantics. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * test(distributed): cover load info persistence and Bug-1 recovery Adds Ginkgo specs that prove the persistence layer behaves correctly and that the reconciler actually recovers from the frontend-restart scenario that was failing in production: registry_test.go: - per-model row survives RemoveAllNodeModelReplicas (the bug repro) - ON CONFLICT (model_name) updates backend type + blob, last-write-wins - legacy NodeModel-blob fallback still works (rolling-upgrade transition) - GetModelLoadInfo returns ErrRecordNotFound when both sources are empty - UpsertModelLoadInfo rejects empty model names reconciler_test.go: - Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a ModelLoadInfo row present, one reconcile tick fires two scheduler calls. Pre-fix this returned "no load info" and the scheduler never got called until a fresh inference request arrived. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] * docs(distributed): note restart-safe reconciler behavior Adds a bullet to the Replica Reconciler section explaining that per-model load metadata is persisted across frontend restarts via the new model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a fresh inference request per model before the reconciler can replace dead replicas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 13:00:06 +02:00
LocalAI [bot]	06e777b75e	feat(distributed): gated X-LocalAI-Node response header (middleware + wrapper) (#9976 ) * feat(distributed): add per-request node ID context holder Introduce pkg/distributedhdr, a leaf package carrying a per-request atomic.Value holder for the picked worker node ID from the SmartRouter (core/services/nodes) up to the HTTP response writer wrapper (core/http/middleware). Avoids the import cycle that a shared key in either consumer would create. Exposes NewHolder, WithHolder, Holder, Stamp, Load, Inherit. The holder is atomic.Value so cross-goroutine publish from the router to the response writer wrapper is race-clean. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): add ExposeNodeHeader middleware + response writer wrapper New ApplicationConfig.ExposeNodeHeader bool + --expose-node-header CLI flag / LOCALAI_EXPOSE_NODE_HEADER env var (default off; the node ID reveals internal topology and is opt-in). The middleware creates a per-request atomic.Value holder, attaches it to c.Request().Context() via distributedhdr.WithHolder, and wraps c.Response().Writer with a custom http.ResponseWriter that sets the X-LocalAI-Node header on first Write / WriteHeader / Flush by reading the holder. Implements http.Flusher, http.Hijacker, Unwrap so it composes cleanly with Echo and http.NewResponseController. request.go propagates the holder onto derived contexts via distributedhdr.Inherit so the holder survives the correlation-ID context replacement. Unit + race-clean concurrency + integration specs. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> feat(distributed): stamp node ID in router and wire middleware to inference routes ModelRouterAdapter.Route stamps the picked node ID into the per-request holder via distributedhdr.Stamp(ctx, result.Node.ID) right after replica selection. Wire ExposeNodeHeader middleware to: - OpenAI chat/completion/embeddings + audio transcriptions/speech + image generations/inpainting - Anthropic /v1/messages - Ollama /api/chat, /api/generate, /api/embed, /api/embeddings - Jina /v1/rerank - LocalAI /v1/vad The middleware's wrapper reads the holder on first byte and sets the X-LocalAI-Node response header before delegating to the underlying writer. Per-request scope means no race under concurrent multi-replica routing. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): thread request context through backend Load + cover ctx propagation Five non-OpenAI backend helpers were silently using app.Context instead of the request context for the gRPC backend call: transcription, TTS, image generation, rerank, VAD. Effect: distributedhdr.Stamp in the router callback was a silent no-op for these paths, AND client cancellation didn't propagate to in-flight inference. Thread c.Request().Context() (or the equivalent input.Context after the request middleware has installed the correlation-ID derived context) through each helper and into ModelOptions via model.WithContext(ctx). ImageGeneration's signature gains a leading ctx parameter; in-tree callers (openai image, openai inpainting, openai inpainting_test) are updated to match. ModelEmbedding gains a leading ctx parameter for the same reason; the openai and ollama embedding handlers pass the request context through. chat_stream_workers.go defers the initial role=assistant chunk emission until the first token callback so the wrapper's lazy X-LocalAI-Node lookup against the loader runs AFTER ml.Load has stamped the per-modelID node ID; semantically identical for clients (role still arrives before any text). Regression test core/backend/ctx_propagation_test.go pins ctx propagation for all five helpers. Docs updated to enumerate the full endpoint coverage of the --expose-node-header flag. Assisted-by: Claude:claude-opus-4-7[1m] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-25 10:51:48 +02:00
Richard Palethorpe	90ea327178	fix(intel): VRAM detection (#9944 ) * fix(gpu-detect): clinfo --json fallback for Intel discrete VRAM ghw returns 0 VRAM for any i915-driven Intel GPU because the kernel driver doesn't expose VRAM through the sysfs paths ghw checks (no mem_info_vram_total — that's an amdgpu interface). xpu-smi, the canonical Intel tool, isn't in the oneAPI base image (it lives in a separate xpumanager package). The capability gate added in `19c92c70` ("default to CPU if there is less than 4GB of GPU available") then demotes the host to CPU even on a 16 GB Arc A770. clinfo ships with the OpenCL ICD loader and is present in the oneAPI base image, so plug it in as the last-resort Intel VRAM source: xpu-smi -> intel_gpu_top -> clinfo --json The parser drops UMA devices via HOST_UNIFIED_MEMORY=true so an iGPU sibling can't double-count system RAM, and dedups by PCI BDF when multiple ICDs enumerate the same physical device (POCL caps reported GLOBAL_MEM_SIZE at 4 GiB; the largest non-capped value wins). Subprocess is wrapped in a 2s timeout and memoised with sync.OnceValue — GPU hardware is static for the process lifetime. The Intel branch also short-circuits when ghw saw no Intel vendor, so NVIDIA-only hosts don't pay the spawn cost. Verified end-to-end on Intel Arc A770: ghw -> 0, clinfo path reports 16,225,243,136 bytes (15.11 GiB), capability gate now passes naturally without LOCALAI_FORCE_META_BACKEND_CAPABILITY=intel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> * feat(gpu-detect): live VRAM usage from DRM fdinfo The clinfo fallback reports total VRAM correctly but leaves UsedVRAM at 0 because OpenCL has no portable live-memory property — the UI ends up showing 0% utilisation even when llama-cpp is actually holding gigabytes in device memory. Fill that gap with the standardised Linux DRM fdinfo interface (Documentation/gpu/drm-usage-stats.rst, kernel ≥5.19). Walking /proc/<pid>/fdinfo for any fd that points at /dev/dri/render* yields drm-total-<region> / drm-resident-<region> keys; aggregate per render-node, resolve the render node to a PCI BDF via /sys/class/drm/<name>/device, and merge the result into the matching GPUMemoryInfo by BDF. Region naming is driver-defined — i915 uses "local0" for device-local VRAM, amdgpu and xe use "vram0" — so a prefix-match on local/vram covers all three DRM drivers that LocalAI cares about. system/gtt/ stolen regions are deliberately excluded since they're host RAM mirrors and would double-count against system RAM. GPUMemoryInfo gains an optional BDF field (`bdf,omitempty` in JSON) so future vendor-specific detectors can plug into the same matcher. Empty BDF skips the merge — non-PCI devices and detection paths that don't surface PCI location keep their existing behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 09:29:00 +02:00
Richard Palethorpe	6a80e23733	feat(middleware): Model routing, PII filtering, Cloud model proxies (#9802 ) Add a routing middleware stack and a cloud-proxy backend. * cloud-proxy: a Go gRPC backend that forwards OpenAI- and Anthropic-shaped chat requests to upstream providers, with an optional translate mode (OpenAI request -> Anthropic /v1/messages -> OpenAI response) and full tool-calling support. * routing: admission control, content-aware model routing (embedding cache + classifier + rerank + Arch-Router score), PII detection/redaction (regex + NER) with streaming filter and OpenAI/Anthropic adapters, and a per-user/per-key billing recorder backed by GORM or in-memory storage. * middleware: UsageMiddleware records usage via the billing recorder, plus admission, route-model, usage-stamp and trace middlewares. * observability: BackendTrace ring buffer stores full request bodies (capped), MITM proxy emits structured trace events, and router classifier decisions surface at /api/router/decide. * gallery: Arch-Router-1.5B (Q4_K_M and Q8_0). * UI: cloud-proxy model-editor fields, classifier system-prompt and score-normalization config, and a Traces page rendering request bodies. Assisted-by: claude-code:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-05-25 09:28:27 +02:00
LocalAI [bot]	1dcd1ae915	chore: ⬆️ Update ggml-org/llama.cpp to `549b9d84330c327e6791fa812a7d60c0cf63572e` (#9974 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> v4.3.1	2026-05-25 09:22:56 +02:00
LocalAI [bot]	acad78a95a	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `9f7ba245ab41e118f03aa8dd5134d18a81159d02` (#9973 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 00:05:29 +02:00
LocalAI [bot]	c94d1e1f5b	chore: ⬆️ Update antirez/ds4 to `f91c12b50a1448527c435c028bfc70d1b00f6c33` (#9975 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 00:05:15 +02:00
Copilot	270c256409	Fix kokoros backend build break from Backend trait drift (#9972 ) * Initial plan * fix(kokoros): implement missing AudioToAudioStream trait stubs Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/e3c6b042-f055-4df9-a05e-e2d8434ee58b Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-24 22:39:15 +02:00
Ettore Di Giacinto	1a30020a82	ci(backend-signing): set COSIGN_EXPERIMENTAL=1 for oci-1-1 referrers mode cosign v2.4.1 still gates --registry-referrers-mode=oci-1-1 behind the experimental flag, so the first signing run after the backend-signing merge failed with "you must set COSIGN_EXPERIMENTAL=1". Set it at the job env level so both the quay and dockerhub cosign steps inherit it, and note the requirement in .agents/backend-signing.md so a future cosign bump can drop the flag. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] v4.3.0	2026-05-24 08:21:05 +00:00
LocalAI [bot]	8bbe89a537	fix(distributed): route per request across loaded replicas + cache probeHealth (#9968 ) * refactor(distributed): extract PickBestReplica from FindAndLockNodeWithModel Lifts the replica-selection policy (in_flight ASC, last_used ASC, available_vram DESC) out of the SQL ORDER BY into a pure Go function in the new replicapicker.go. The SQL clause keeps its FOR UPDATE atomicity and remains the production path used by SmartRouter; PickBestReplica is the canonical implementation that the future per-frontend rotating replica cache (TODO referenced from pkg/model) will call against an in-memory snapshot without paying a DB round-trip per inference. A new registry_test mirror spec seeds a multi-tier scenario and asserts both layers pick the same replica, so any future tweak to either side fails the test until the other side is updated. No behavior change. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * fix(distributed): route per inference request and cache probeHealth Two related fixes that together restore load balancing across loaded replicas of the same model. 1. ModelLoader.Load and LoadModel bypass the local Model cache when modelRouter is set. The cached Model wraps an InFlightTrackingClient bound to a single (nodeID, replicaIndex) — reusing it pinned every subsequent request to whichever node won the very first pick, so FindAndLockNodeWithModel's round-robin never got a chance to run even after the reconciler scaled the model out to a second node. In distributed mode SmartRouter.Route now runs per request, and PickBestReplica picks the least-loaded replica each time. SmartRouter has its own coalescing (advisory DB lock for first-time loads + singleflight on backend.install RPC) so concurrent first requests for a not-yet-loaded model still produce a single worker side install. 2. SmartRouter.probeHealth memoizes successful gRPC HealthCheck results in a new probeCache (probe_cache.go) with a 30s TTL. With per-request routing every inference call hits probeHealth, and llama.cpp-style backends serialize HealthCheck behind active Predict — so a burst of incoming requests stalled on the probe to a node already mid-stream, tripping the 2s timeout and falling through to the install path. singleflight collapses N concurrent first-time probes for the same (node, addr) into one round-trip, failed probes invalidate the entry so the staleness-recovery path still triggers, and the TTL matches pkg/model/model.go's healthCheckTTL so the single-process and distributed paths share a staleness budget. The background HealthMonitor still reaps actually-dead backends within ~45s. The bypass introduces one short FindAndLockNodeWithModel transaction per inference. A TODO in pkg/model/loader.go documents the future per modelID rotating-replica cache that would reuse PickBestReplica against an in-memory snapshot and skip the DB round-trip for hot paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-24 08:15:27 +00:00
LocalAI [bot]	dcc5599f89	chore: ⬆️ Update leejet/stable-diffusion.cpp to `a397e03488cc27e1a42da646b82dfce9f50741c0` (#9965 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-24 08:35:36 +02:00
LocalAI [bot]	a95f4e63e0	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02` (#9966 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 23:52:51 +02:00
LocalAI [bot]	dfd19a3f88	chore: ⬆️ Update ggml-org/llama.cpp to `c0c7e147e7efa6c5858754b47259ba4880f8a906` (#9963 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 23:52:36 +02:00
LocalAI [bot]	d7387c725c	feat(swagger): update swagger (#9962 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 23:52:10 +02:00
LocalAI [bot]	63d84a5705	chore: ⬆️ Update antirez/ds4 to `444afce822057d87f14c4dec307dce24fd49b3ee` (#9964 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 23:51:53 +02:00
LocalAI [bot]	1198d10b58	fix(traces): cap backend trace Data to keep admin UI responsive (#9960 ) * fix(traces): cap backend trace Data field so the admin UI stays responsive The previous fix (#9946) capped API trace bodies but missed backend traces, which carry the same blast radius: - LLM backend traces store the full chat messages JSON, full response, and full streaming deltas. Every agent-pool reasoning step ships the full RAG-augmented history (50-500 KiB per trace, often 100+ traces queued). - TTS / audio_transform / transcript traces embed a 30s audio snippet as base64, around 1.3 MiB per trace. Both blow the /api/backend-traces JSON past tens of MiB. The admin Traces page then keeps re-downloading and re-parsing the buffer faster than the 5s auto-refresh and stays in the loading state forever, the same symptom the API-side fix addressed. Apply two complementary caps, both honoring LOCALAI_TRACING_MAX_BODY_BYTES: Option A (safety net in core/trace): RecordBackendTrace walks the Data map recursively and replaces any string value larger than the cap with "<truncated: N bytes>". Catches anything a future producer forgets. Option B (head-preserving at the producer): - core/backend/llm.go: TruncateToBytes on messages, response, and chat_deltas content/reasoning_content so the leading content stays readable in the UI. - core/trace/audio_snippet.go: omit audio_wav_base64 when the encoded blob would exceed the cap (truncated base64 is undecodable). The quality metrics still ship and the UI's WaveformPlayer simply skips when the field is absent. TruncateToBytes is bounded to <= maxBytes so Option A leaves the producer's head-preserving output alone instead of replacing it with the bare marker. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * fix(react-ui): expose tracing_max_body_bytes in Settings and Traces panels The setting was already plumbed through env (LOCALAI_TRACING_MAX_BODY_BYTES), CLI flag, and the runtime_settings.json GET/PUT schema, but neither the main Settings page nor the inline Traces panel offered an input for it. Admins hitting the "Traces UI stuck loading" symptom had to know to set an env var or PUT raw JSON to /api/settings to dial the cap. Add a "Max Body Bytes" row next to "Max Items" in both places. Same input type, same disabled-when-tracing-off semantics, placeholder shows the 65536 default so users see what they're inheriting. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * test(react-ui): disambiguate Max Items locator after adding Max Body Bytes The Tracing settings panel now has two number inputs. The previous spec matched 'input[type="number"]' which became ambiguous and triggered a Playwright strict-mode violation in CI. Switch to getByPlaceholder('100') for Max Items and add a parallel spec for the new Max Body Bytes field using getByPlaceholder('65536'). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-23 14:50:40 +02:00
LocalAI [bot]	a0f3e26245	fix(distributed): make admin backend installs resilient and observable (#9958 ) * feat(distributed): add configurable NATS backend install/upgrade timeouts Adds BackendInstallTimeout and BackendUpgradeTimeout to DistributedConfig with 15m defaults, following the existing MCPToolTimeout / WorkerWaitTimeout pattern. These will replace the hardcoded literals in RemoteUnloaderAdapter so admin-driven backend installs across the cluster survive long OCI image pulls that previously timed out at 3m. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(distributed): gofmt alignment after timeout fields Re-aligns the Validate() negative-duration map and the Default* const block so the new BackendInstall/UpgradeTimeout entries do not leave the surrounding columns mis-padded. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(cli): surface LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT and _UPGRADE_TIMEOUT Parses the two new env vars on the run CLI and threads them through the existing AppOption builder so DistributedConfig picks them up. Invalid duration strings now fail loudly at startup rather than silently falling back to the default. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): inject NATS install/upgrade timeouts into RemoteUnloaderAdapter Removes the hardcoded 3m / 15m literals from RemoteUnloaderAdapter and threads in DistributedConfig.BackendInstallTimeoutOrDefault() and BackendUpgradeTimeoutOrDefault() at construction. Install now defaults to 15m (was 3m); cold OCI image pulls on Jetson Wi-Fi routinely blew past the old ceiling. Scripted messaging client captures the timeout so tests can assert the configured value actually reaches the NATS request. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): introduce galleryop.ErrWorkerStillInstalling sentinel When the NATS request-reply for backend.install (or .upgrade) times out the worker is almost always still pulling the OCI image. Wrap the timeout in a typed sentinel so the manager above can distinguish "worker hung" from "worker still working" and leave the pending_backend_ops row in place for the reconciler to confirm via backend.list. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): treat NATS install timeout as in-progress, not failure When a worker times out replying to backend.install but the install is still running on the worker, enqueueAndDrainBackendOp now reports a running_on_worker status and pushes NextRetryAt out by the install timeout so the reconciler does not immediately re-fire another install while the worker is still pulling the image. The pending_backend_ops row stays in place for the next reconciler pass to confirm via backend.list. InstallBackend wraps the result in galleryop.ErrWorkerStillInstalling so callers can branch (galleryop renders yellow in-progress instead of red error). UpgradeBackend uses the same wrap. Adds RemoteUnloaderAdapter.InstallTimeout() so the manager can push NextRetryAt by the configured timeout without reaching into a private field, and NodeRegistry.RecordPendingBackendOpInFlight as the soft cousin of RecordPendingBackendOpFailure. Also includes incidental gofmt-driven struct-field alignment in registry.go on lines unrelated to the change (touched files are re-formatted to canonical form per project policy). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(distributed): don't increment Attempts on in-flight install timeout An in-flight timeout (worker still pulling the OCI image) is not a failed attempt, it's a delayed one. Incrementing Attempts let genuinely-progressing slow installs (e.g. 30 GB CUDA images on Wi-Fi) trip the reconciler's maxPendingBackendOpAttempts cap and dead-letter the queue row while the worker was still legitimately working. RecordPendingBackendOpInFlight now only updates LastError and NextRetryAt. Also documents "running_on_worker" in the NodeOpStatus.Status enum comment so Task 6 implementers see the full surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): surface ErrWorkerStillInstalling as non-error OpStatus When the distributed backend manager returns an error that wraps ErrWorkerStillInstalling, backendHandler now completes the op with a "still installing in background" message rather than marking it as a red failure. Admin UI sees a yellow in-progress state; reconciler confirms completion on its next pass. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): end-to-end install-timeout-then-reconcile Wires Task 1-6 end-to-end so any seam mismatch surfaces in CI rather than during a real cluster install. NATS times out, the queue row stays alive with running_on_worker status, the worker eventually reports the backend installed via backend.list, the manager surfaces it via ListBackends. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / _UPGRADE_TIMEOUT Add the two new operator-tunable env vars to the Frontend Configuration table in the distributed-mode docs. Explains the 15m default, when to raise it (slow links pulling multi-GB OCI images), and the new "still installing in background" admin-UI state when the round-trip times out but the worker is still working. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): clear pending install rows when backend.list confirms DistributedBackendManager.ListBackends now proactively clears pending_backend_ops install rows whose (nodeID, backend) is reported installed by backend.list. Operator UI updates immediately instead of waiting up to installTimeout (default 15m) for the next reconciler tick after NextRetryAt. Only install rows are cleared; upgrade and delete intents are not satisfied by presence in backend.list and continue to drain through their normal reconciler paths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(messaging): add BackendInstallProgressEvent wire type and subject New NATS subject nodes.<nodeID>.backend.install.<opID>.progress lets the worker publish transient progress events (file, current/total bytes, percentage, phase) while a long-running install pulls its OCI image. BackendInstallRequest gains an optional OpID field so the worker knows which subject to publish on. Transient pub/sub (not JetStream): the install reply remains ground truth for success/failure; dropped progress events are tolerable. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * style(messaging): drop em-dash from BackendInstallProgress test comment Per project convention (no em-dashes anywhere). Comment substance is unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): worker publishes debounced install progress over NATS When BackendInstallRequest.OpID is set, the worker's backend.install handler wires a debounced publisher (250ms window) into the gallery download callback. Each tick becomes a BackendInstallProgressEvent on nodes.<nodeID>.backend.install.<opID>.progress; the publisher always emits a final event on Flush so the UI sees the terminal percentage. Old masters that do not set OpID continue to run silent installs: no behavior change for them. Lock ordering: the publisher releases its mutex before calling messaging.Publish so a slow network never stalls the install loop. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): RemoteUnloaderAdapter subscribes to install progress InstallBackend gains opID + onProgress parameters. When both are set, the adapter subscribes to nodes.<nodeID>.backend.install.<opID>.progress BEFORE publishing the install request, decodes each message into the caller's onProgress callback in a goroutine (so a slow callback never stalls the NATS reader thread), and unsubscribes after RequestJSON returns. When onProgress is nil OR opID is empty (the reconciler retry path), subscription is skipped entirely - silent installs cost nothing extra. Subscribe failure is logged at Warn and the install proceeds without progress streaming; the NATS round-trip still owns terminal status. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): forward backend install progress into galleryop OpStatus DistributedBackendManager.InstallBackend now passes the gallery op ID and a progress bridge into the adapter call. Each BackendInstallProgressEvent from the worker becomes a galleryop.ProgressCallback tick - which the existing backendHandler already turns into OpStatus.UpdateStatus, so the admin UI/SSE polling sees per-byte progress for distributed installs without any UI-side change. UpgradeBackend is intentionally left silent for now: its wire request (BackendUpgradeRequest) does not carry OpID, and rolling-update fallback is the rarer path. Will be picked up in a follow-up if the worker upgrade path also gets a progress channel. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(distributed): InstallBackend tolerates silent (pre-Phase-2) workers A worker on pre-Phase-2 code never publishes progress events. The new master subscribes optimistically; this spec pins that a silent worker still produces a green install with no progressCb ticks. The install reply is the source of truth for terminal state; the progress stream is a best-effort UX enrichment. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document install progress streaming Note the new nodes.<nodeID>.backend.install.<opID>.progress subject and the silent-worker compatibility behavior so operators know to expect real-time progress and what happens on a mixed-version cluster. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): note progress-event ordering trade-off in InstallBackend Document near the goroutine dispatch why ordering at the consumer is best-effort, why it rarely matters in practice (worker debounce >> goroutine jitter), and what a future hardening pass would look like (Seq field + stale-by-seq drop). Stops the next reader from accidentally "fixing" the goroutine pool away. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeProgress + OpStatus.Nodes for per-node breakdown Adds the data model the UI needs to render an expandable per-node breakdown of a fanned-out backend install. NodeProgress carries node identity (ID + name), per-node status (queued / running_on_worker / success / error / downloading), the current file + bytes + percentage from the Phase 2 progress stream, and any per-node error. OpStatus.Nodes is the slice the /api/operations handler will surface in a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): UpdateNodeProgress merges per-node ticks by NodeID GalleryService.UpdateNodeProgress(opID, nodeID, np) merges a NodeProgress into OpStatus.Nodes (keyed by NodeID, no duplicates) and mirrors the latest tick into the aggregate Progress / FileName / DownloadedFileSize / TotalFileSize fields so the legacy single-bar OperationsBar view keeps working unchanged alongside the new per-node breakdown. Concurrent-safe via the existing g.Mutex. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(distributed): write per-node OpStatus entries during install fan-out DistributedBackendManager now accepts a nodeProgressSink and feeds it two streams: 1. enqueueAndDrainBackendOp emits a per-node terminal entry on each status it appends to BackendOpResult (queued, success, error, running_on_worker). The opID is threaded through the function so the sink gets the right gallery op identity. 2. The install apply closure fans each BackendInstallProgressEvent into the sink as a downloading entry, alongside the legacy progressCb path so the aggregate single-bar view stays correct. Production wiring passes the GalleryService (which implements UpdateNodeProgress via Task 2) as the sink. Single-node tests pass nil. DeleteBackend and UpgradeBackend pass an empty opID so the sink path no-ops for ops that aren't gallery-tracked the same way as Install. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(operations): expose per-node breakdown on /api/operations When an operation's OpStatus has Nodes entries (populated by the Phase 4 progress sink wiring), surface them as a "nodes" array on the /api/operations response, sorted by node_name for stable rendering. Backward compatible: legacy clients ignore the field; ops without any node entries (single-node mode, model installs) omit the array entirely thanks to the empty-slice guard. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(ui): per-node breakdown in OperationsBar When an install op fans out to more than one worker, the operations bar now shows a "N nodes" chevron that expands into a per-node list. Each row carries the node's status (color-coded pill), the current file being downloaded, byte counts, percentage, and a thin per-node progress bar. Yellow "Worker busy" pill marks running_on_worker status with a tooltip explaining the NATS round-trip timed out but the worker is still installing in the background. Backward compatible: ops without a nodes field (legacy or single-node mode) render as before. State for expand/collapse is local to the component, keyed by jobID/id - reload starts collapsed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): document per-node breakdown in the operations bar Adds a short subsection covering the expandable "N nodes" chevron in the OperationsBar admin UI, the meaning of each status pill, and how it relates to the /api/operations nodes array. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(galleryop): UpdateStatus preserves Nodes when caller sends none Real-world bug surfaced by the Phase 4 multi-worker smoke test: the nodes[] array in /api/operations flickered between a single node at a time on a 2-worker install. Root cause: the Phase 2 progress bridge also calls the legacy progressCb -> UpdateStatus(&OpStatus{...}) on every tick. UpdateStatus then overwrote the entire status pointer, wiping the Nodes slice that UpdateNodeProgress had just merged in. Fix: in UpdateStatus, if the incoming op has an empty Nodes slice, carry forward the previous status's Nodes before storing. Callers that explicitly populate Nodes still win (their slice replaces the prior one, no merge across the two code paths). Two regression specs added pinning both directions of the contract. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs(distributed): strip implementation details from user-facing docs Trim the new install/upgrade timeout rows and the install-progress sections to focus on what the operator sees and tunes. Drops: - the NATS subject names and pub/sub mechanics - "round-trip" / reconciler / backend.list jargon - /api/operations polling cadence - "pre-2026-05-22" version references Reframes the breakdown text around the admin UI (Operations Bar, chevron, status pills, "Worker busy" tooltip). Implementation context lives in the agent notes and code comments. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(config): move DistributedConfig.Validate flag names to constants The negative-duration check map was a wall of literal kebab-case strings that had to stay in sync with the kong-derived CLI flag names manually. Move them to a Flag* const block alongside the existing Default* block so a rename of either the Go field or the CLI naming convention forces a compile error rather than silent drift. Sole consumer today is Validate; the constants are exported so future operator-facing surfaces (e.g. error messages on other validation paths) can reference them by name instead of repeating the literals. Tests pin both the literal values (so a future "let's just rename this" doesn't accidentally regress the CLI flag) and the negative- duration error message for the new BackendInstall / BackendUpgrade fields. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(distributed): extract NodeStatus and Phase enums to constants Sweep for the same literal-string-as-identifier pattern called out on the Validate flag names: the per-node install status enum ("queued" \| "downloading" \| "running_on_worker" \| "success" \| "error") appeared as raw literals across managers_distributed.go (10+ sites, including 3 separate `n.Status == "running_on_worker"` checks), operation.go, and the test suite. Same shape for the Phase enum ("resolving" \| "downloading" \| "extracting" \| "starting") in the worker-side progress publisher. Promote both to exported const blocks: - galleryop.NodeStatus{Queued,Downloading,RunningOnWorker,Success,Error} shared between galleryop.NodeProgress.Status (the wire field) and nodes.NodeOpStatus.Status (the in-process per-node summary) - messaging.Phase{Resolving,Downloading,Extracting,Starting} shared between the worker publisher and any future consumer that needs to switch on phase Tests pin both the literal values (so a future "let's just rename" doesn't silently change the JSON wire) and use the constants in setup (so the producer side stays drift-protected). Wire-format assertions on the /api/operations JSON output keep their literals deliberately, so the constant value can never silently diverge from what the UI receives. Out of scope for this PR (separate cleanup): the finetune and quantization job-status enums have the same anti-pattern with 14+ literal sites each, but predate this PR's work. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-23 12:35:44 +02:00
LocalAI [bot]	e4cc1f11f3	chore: ⬆️ Update ggml-org/llama.cpp to `1acee6bf8939948f9bcbf4b14034e4b475f06069` (#9952 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 08:38:29 +02:00
LocalAI [bot]	6ed269d0b9	chore: ⬆️ Update ggml-org/whisper.cpp to `0ccd896f5b882628e1c077f9769735ef4ce52860` (#9954 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 08:37:26 +02:00
LocalAI [bot]	5756fb046d	chore: ⬆️ Update leejet/stable-diffusion.cpp to `0baf721215f45335a5df8caf0ecb34e870c956e7` (#9955 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 08:37:10 +02:00
Copilot	7980629bc5	Fix backend manifest merge signing on current cosign releases (#9957 ) * Initial plan * fix: remove deprecated cosign bundle flag from backend merge workflow Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/4207dabc-14ec-4655-9594-487338977fcf Co-authored-by: mudler <2420543+mudler@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-23 00:20:28 +02:00
LocalAI [bot]	d0a59be9de	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b3d39cff8bffbd67296d6badd4076a1486a0715c` (#9953 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 23:58:48 +02:00
LocalAI [bot]	5cda4f1ccf	fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels (#9950 ) * fix(vllm): switch L4T13 backend to PyPI aarch64+cu130 wheels The L4T13 vllm backend pulled torch / torchvision / torchaudio / vllm from pypi.jetson-ai-lab.io's sbsa/cu130 mirror via [tool.uv.sources] with no version pins. That mirror started shipping torch 2.11.0 next to a vllm-0.20.0+cu130 wheel that was still compiled against torch 2.10's c10 ABI, so uv landed on the mismatched pair and vllm crashed at import: ImportError: vllm/_C.abi3.so: undefined symbol: _ZN3c1013MessageLoggerC1EPKciib (c10::MessageLogger's constructor signature changed between torch 2.10 and 2.11; the vllm wheel referenced the 2.10 form, the installed libc10.so exported only the 2.11 form.) Since torch 2.11 (April 2026) PyPI publishes its own aarch64 + cu130 manylinux wheels, and vllm 0.20.0 ships an aarch64 wheel whose Requires- Dist locks torch==2.11.0 / torchvision==0.26.0 / torchaudio==2.11.0. That makes uv's resolver produce an ABI-consistent set automatically, so the mirror and the [tool.uv.sources] pinning are no longer needed. flash-attn is dropped from the dep list: PyPI has no aarch64 wheel, but vLLM 0.20+ already bundles its own vllm_flash_attn (fa2 + fa3) inside the main wheel, so the Dao-AILab package isn't required at runtime. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(vllm): retire l4t13 pyproject.toml in favor of requirements-.txt pyproject.toml only existed because uv pip install -r requirements.txt doesn't honor [tool.uv.sources]. The previous commit dropped [tool.uv. sources] (PyPI now serves the aarch64 + cu130 wheels directly), so the file no longer carries any logic the requirements-.txt path can't. Replace with the same two-file pattern every other build profile uses: - requirements-l4t13.txt (accelerate / torch / transformers / bitsandbytes - matches cublas13's split) - requirements-l4t13-after.txt (vllm; runs after the base resolve so the cu130 torch wheel lands first) install.sh's whole l4t13 elif branch goes away; libbackend.sh's installRequirements already handles the requirements-install.txt build- deps pass, the C_INCLUDE_PATH export for PORTABLE_PYTHON, and the runProtogen call, so falling through to the standard else: branch produces identical install behavior with less surface area. No functional change at install time - same wheels, same order. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(sglang,vllm-omni): switch L4T13 backends to PyPI aarch64+cu130 wheels Same root cause and same fix as the vllm backend in the previous commits: the L4T13 sglang and vllm-omni backends both pulled their accelerator stack from pypi.jetson-ai-lab.io's sbsa/cu130 mirror with no version pins, so they would silently land on the same torch 2.11 vs cu130-built wheel ABI mismatch the moment the mirror published an out-of-sync pair. sglang ------ - Drop pyproject.toml + [tool.uv.sources]. The historical comment said the [all] extra was unsafe on aarch64 because of decord, but sglang 0.5.x now uses `decord2` on aarch64/arm/armv7l (which ships cp312 aarch64 wheels), so we can match cublas13's sglang[all]>=0.5.11 pin and stop being capped at the 0.5.1.post2 the L4T mirror shipped. That unblocks Gemma 4 / MTP recipes on Jetson Thor. - New requirements-l4t13.txt mirrors the cublas13 split (accelerate / torch / torchvision / torchaudio / transformers), requirements-l4t13- after.txt carries sglang[all]>=0.5.11. - install.sh's l4t13 elif branch goes away; falls through to the standard installRequirements path. vllm-omni --------- - requirements-l4t13.txt drops --extra-index-url to jetson-ai-lab and drops flash-attn (PyPI has no aarch64 wheel, vLLM 0.20+ bundles its own vllm_flash_attn fa2 + fa3 internally). - install.sh's l4t13 vllm-install branch collapses into the cublas13 branch since both now just run `pip install vllm --torch-backend=auto` against PyPI. - --index-strategy=unsafe-best-match is dropped from the top-level l4t13 guard; without the L4T mirror in the picture it had no purpose. The from-source vllm-omni install on top still keeps its existing `sed -i '/^fa3-fwd[[:space:]]==/d' requirements/cuda.txt` workaround - fa3-fwd has no aarch64 wheel and no sdist, unrelated to flash-attn. Reference: https://pytorch.org/blog/vllm-and-pytorch-work-together-to-improve-the-developer-experience-on-aarch64/ Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(sglang): drop [all] extra on l4t13 - xatlas has no aarch64 wheel CI revealed that sglang[all]==0.5.12 transitively pulls xatlas via the [diffusion] sub-extra, and xatlas ships no aarch64 wheel. Its sdist depends on scikit_build_core without declaring it in build-system. requires, so under --no-build-isolation uv can't build it from source: × Failed to build `xatlas==0.0.11` ├─▶ The build backend returned an error ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1) ModuleNotFoundError: No module named 'scikit_build_core' help: `xatlas` (v0.0.11) was included because `sglang[all]` (v0.5.12) depends on `xatlas` Upstream sglang explicitly gates st_attn and vsa on `platform_machine != aarch64` inside the same [diffusion] extra but forgot xatlas - same class of bug that bit the old decord pin. Use plain `sglang>=0.5.11` on l4t13. backend.py imports only base sglang.srt symbols (Engine, ServerArgs, FunctionCallParser, ReasoningParser); the [all] extras are optional accelerators not required at import time. cublas13 (x86_64) keeps [all] because xatlas has x86_64 wheels there. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-22 23:01:22 +02:00
LocalAI [bot]	c500461c69	feat(config): default prompt_cache_all to true (#9951 ) Upstream llama.cpp defaults `cache_prompt = true` (common/common.h), but `parse_options` in the grpc-server backend unconditionally forwards the proto `PromptCacheAll` field, so any model that didn't set `prompt_cache_all: true` in its YAML was getting `cache_prompt=false` — silently overriding llama.cpp's own default. With `kv_unified` and `cache_idle_slots` already on by default, this was the last piece preventing the per-request prompt cache from being usable out of the box. Make `PromptCacheAll` tristate (`*bool`), default it to `true` in `SetDefaults`, and dereference at the proto boundary. Users can still opt out with an explicit `prompt_cache_all: false`. Same pattern as `MMap`, `MMlock`, `Reranking`, etc. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-22 22:06:22 +02:00
LocalAI [bot]	834ecc36bf	fix(react-ui): unify backend-logs entry point for distributed mode (#9949 ) In distributed mode the local /api/backend-logs WebSocket has nothing behind it (inference runs on workers), so the "View backend logs" link in Traces (and the action in Manage when previously not hidden) dead- ended on /app/backend-logs/<modelId>. Manage worked around it by hiding the action; Traces still rendered the link. Make /app/backend-logs/:modelId the single, mode-aware entry point. A new BackendLogsRouter probes useDistributedMode and forks: - standalone: existing local WebSocket view (BackendLogsDetail). - distributed: DistributedBackendLogsResolver fans out to each node via nodesApi.getModels, filters by model_name, and routes: * 0 hits -> empty state with a link to the Nodes page. * 1 hit -> <Navigate replace> to /app/node-backend-logs/<nodeId>/<modelId>, preserving the ?from= deep-link timestamp. * N hits -> picker listing each hosting worker (node id, replica index, load state) so the operator can choose which worker's logs to view. Bare modelId in the redirect target intentionally aggregates that node's replicas via the worker's BackendLogStore, matching the existing per-node link pattern in Nodes.jsx. Revert the per-caller distributed checks now that routing is centralised: drop the hidden:distributedMode guard on Manage's Backend logs action, and remove the prop threading in Traces so the link is unconditional. Any future view that wants to link to backend logs uses the same URL and gets correct behaviour in both modes. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-22 22:00:08 +02:00
LocalAI [bot]	61bf34ea2f	fix(traces): cap captured body size to keep admin Traces UI responsive (#9946 ) The trace middleware buffered the full request and response bodies for every JSON exchange. With a chatty agent-pool RAG workload, /embeddings responses (large vector arrays) accumulated to tens of MB in the in-memory buffer; the admin Traces page would then download and parse 40+ MB on every load and on every 5s auto-refresh, locking the UI in a loading state. Add LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) that caps each captured body. The full payload still flows through to the real client; only the trace copy is bounded. Exchanges record body_truncated and original body_bytes so the dashboard can show that truncation happened. The cap is configurable via env, CLI, and runtime_settings.json. Also unblock recovery: the Traces page now keeps the Clear button enabled while loading, since "buffer too large to render" is exactly when the user needs to clear it. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-22 15:29:24 +02:00
LocalAI [bot]	0b2ae3c6ca	fix(openai): stream usage non-zero when tools are enabled (#9941 ) * chore: ignore local .worktrees directory Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(openai): stream usage non-zero when tools are enabled The streaming chat-completions worker for tool-bearing requests (processTools in core/http/endpoints/openai/chat.go) never forwarded the cumulative TokenUsage from ComputeChoices to the chunks it placed on the responses channel. The outer streaming loop's running usage tracker therefore stayed at the zero value, and the include_usage trailer reported {prompt_tokens:0, completion_tokens:0, total_tokens:0} whenever the request carried a `tools` array. Without tools, the alternative `process` path stamps Usage on every chunk, so that path was unaffected. Forward the final TokenUsage via a usage-only sentinel chunk (empty Choices, populated Usage) emitted right before close(responses). The outer loop's per-chunk Usage capture moves above the empty-Choices skip so the sentinel updates the tracker without ever reaching the wire, keeping the existing OpenAI spec contract (intermediate chunks carry no `usage` field, and the deferred-final-chunk helpers remain Usage-free per the regression test for issue #8546). Adds streamUsageFromTokenUsage, usageSentinelChunk, and applyChunkToUsage helpers with focused Ginkgo coverage plus a flow-level test that mirrors the outer-loop sequence. Fixes #9927 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4-7 [Claude Code] * refactor(openai): return final TokenUsage from stream workers Replace the usage-only sentinel SSE chunk introduced in the previous commit with a plain return value. The streaming workers process and processTools (now extracted as package-level processStream and processStreamWithTools) return (backend.TokenUsage, error); the outer ChatEndpoint loop reads the cumulative counts off the existing `ended` channel (now carrying streamWorkerResult{usage, err}) and builds the include_usage trailer from a normal Go value after the LOOP exits. This drops the empty-Choices "skip but capture Usage" rule from the outer loop and removes the usageSentinelChunk / applyChunkToUsage helpers entirely. The SSE responses channel is back to a single purpose: wire chunks only. processStream and processStreamWithTools move into chat_stream_workers.go so they can be exercised directly from tests. The chat_stream_usage_test.go suite now drives the workers with a mocked backend.ModelInferenceFunc and asserts on the returned TokenUsage. The regression coverage for issue #9927 is therefore behavioral: reverting the fix (discarding ComputeChoices' usage return) makes the assertions fail with concrete count mismatches. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-22 10:13:41 +02:00
LocalAI [bot]	4735345105	chore: ⬆️ Update ggml-org/llama.cpp to `bb28c1fe246b72276ee1d00ce89306be7b865766` (#9934 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 09:49:33 +02:00
LocalAI [bot]	7384fd800b	chore: ⬆️ Update antirez/ds4 to `8d576642c39b9a2d782a80159ba84ef5a81c0b81` (#9932 ) ⬆️ Update antirez/ds4 Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 08:31:49 +02:00
LocalAI [bot]	6942713d85	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3a8788cb7d74f185d6b18688e9563015524ecaf5` (#9933 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-22 00:31:19 +02:00
LocalAI [bot]	0cf52c44d4	chore: ⬆️ Update ggml-org/whisper.cpp to `8443cf05e3fa8ce1b32348e1bcbcf8fc31f7f3ae` (#9929 ) ⬆️ Update ggml-org/whisper.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-21 23:24:01 +02:00
LocalAI [bot]	0d34cf7cbd	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `48a55f74e4c6e2aeda363dd386c1ac9170a0af71` (#9930 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-21 23:23:37 +02:00
LocalAI [bot]	f0cb02afb8	feat(usage): attribute Sources rows to user accounts in admin view (#9935 ) The merged feature (#9920) let admins see per-API-key and per-source totals but did not surface which user owned each key, and lumped every user's Web UI traffic into a single global Web UI row. This makes the admin Sources tab properly per-user attributable: - KeyTotal gains UserID + UserName, populated from the snapshot the usage middleware already records. The by_key roll-up now groups by (api_key_id, api_key_name, user_id, user_name). - New SourceTotals.ByUserSource roll-up groups (source, user_id, user_name) for sources without a key identity (web, legacy). Only populated on the admin path (includeLegacy=true); the non-admin endpoint stays unchanged for backwards compatibility. - SourcesTable accepts showUserColumn={isAdmin}; admin view renders a User column, makes the search match user name/id, and expands Web UI / legacy pseudo-rows from the global aggregate to one row per user using by_user_source. Refs: #9862 Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-21 23:23:06 +02:00
LocalAI [bot]	a39e025d64	fix(nodes): make per-node backend install async via gallery job queue (#9928 ) * feat(galleryop): add TargetNodeID to ManagementOp for single-node installs Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(galleryop): add NodeScopedKey helpers for per-node opcache rows Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(galleryop): use strings.Cut for NodeScopedKey parsing, reject empty nodeID Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(nodes): scope DistributedBackendManager.InstallBackend to single node via TargetNodeID Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(http): make /api/nodes/:id/backends/install async via gallery service job queue The handler previously called unloader.InstallBackend synchronously and blocked the browser for up to 3 minutes waiting on the NATS reply. It now enqueues a TargetNodeID-scoped ManagementOp on BackendGalleryChannel and returns HTTP 202 + jobID immediately, matching /api/backends/install/:id. The opcache key is built via NodeScopedKey(nodeID, backend) so concurrent installs of the same backend across different nodes do not stomp each other. galleryService/opcache/appConfig are threaded through RegisterNodeAdminRoutes for this. Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(http): log malformed backend_galleries override and stop test drain goroutine Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(api): expose nodeID for node-scoped backend ops in /api/operations Node-scoped backend installs land in opcache under "node:<nodeID>:<backend>" keys. Without splitting that prefix back out, the operations panel renders the full key as the display name and has no structured way to label which worker an install is targeting. Detect the prefix, surface nodeID as its own response field, and reduce the display name back to the bare backend slug. Bare (non-scoped) ops are left untouched so legacy installs do not gain a misleading empty nodeID. Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(react-ui): poll job status for node-targeted backend installs Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(react-ui): make NodeInstallPicker state updates pure and surface cancellations as errors Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(react-ui): clarify async semantics in handleInstallOnTarget Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(http): use statusUrl casing for node install response to match codebase precedent Assisted-by: Claude:opus-4-7 [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-21 22:25:53 +02:00
Ettore Di Giacinto	05e8e1e9f4	ci(images): publish chronologically-orderable master-<epoch>-<sha> tags The existing master push pipeline produces `master` (rolling) and `sha-<short>` tags. Neither is orderable by build time, so downstream GitOps that want to auto-bump to the newest master build (e.g. Flux ImagePolicy) can't pick the latest from the tag list — alphabetical sort over hex shas is effectively random, and the rolling `master` tag can't be referenced as an immutable bump target. Add a third tag of the form `master-<epoch>-<sha>` (Unix epoch in seconds + short sha), gated on default-branch pushes via metadata- action's `is_default_branch` predicate. The sha is retained for traceability; the epoch makes the tags numerically orderable, so a Flux ImagePolicy like filterTags: pattern: '^master-(?P<ts>[0-9]+)-[a-f0-9]+$' extract: '$ts' policy: numerical: order: asc will reliably bump to the newest master build. Applied to both image_build.yml (OCI labels stay consistent) and image_merge.yml (the actual tag publisher via buildx imagetools).	2026-05-21 17:18:30 +00:00

1 2 3 4 5 ...

6462 Commits