LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-05-16 20:52:08 -04:00

Author	SHA1	Message	Date
Ettore Di Giacinto	a6121e240e	docs: credit the LocalAI maintainers team Update README and docs to attribute maintenance to the LocalAI team (Ettore Di Giacinto and Richard Palethorpe) and drop the autonomous AI dev team section. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Bash]	2026-05-02 23:37:04 +00:00
Ettore Di Giacinto	87cf736068	feat(react-ui): add multilingual (i18n) support (#9642 ) Adds end-to-end internationalization to the React UI with five seed languages (English, Italian, Spanish, German, Simplified Chinese) and a sidebar-footer language switcher next to the existing theme toggle. Library: react-i18next + i18next + i18next-http-backend + i18next-browser-languagedetector. The detector caches the user's choice in localStorage (key `localai-language`, mirroring the existing `localai-theme` convention) and updates the `<html lang>` attribute on change. fallbackLng is `en`, so any missing translation in another locale falls back transparently. Translation files live under `public/locales/<lng>/<ns>.json`. They ride along with the existing `//go:embed react-ui/dist/` directive, but the previous SPA route in core/http/app.go only exposed `/assets/` from the embedded React build. This commit generalizes the asset handler into a `serveReactSubdir(subdir)` helper and adds a matching `/locales/*` route so i18next-http-backend can fetch the JSONs at runtime. The http-backend `loadPath` is built via the existing `apiUrl()` helper so instances served under a sub-path (e.g. `<base href="/ui/">`) resolve correctly. Namespaces (13): common, nav, errors, auth, home, models, importModel, chat, agents, skills, collections, media, admin. Translated UI surfaces include the sidebar/header/footer chrome, login + account flows, the Home dashboard (incl. the manage-by-chat assistant CTA), the model gallery + import flow, the chat experience (Chat.jsx + ChatsMenu), agents/skills/collections list pages, the studio media tabs (Image, Video, TTS), and the admin page-headers (Settings incl. its section nav, Manage, Backends, Traces, Nodes, P2P, Users, Usage). Shared components (ConfirmDialog, Toast) take their default labels from the common namespace so callers don't need to pass strings explicitly. Tooling for incremental adoption is included: - `i18next-parser.config.js` + `npm run i18n:extract` to sweep `t()` keys into the JSON skeletons. - `scripts/translate-locales.mjs` (one-off helper) to bootstrap non-English locales from English source via OpenAI or Anthropic APIs, with --copy mode as a placeholder fallback. Idempotent; preserves existing translations unless --overwrite is passed. Larger config-driven pages (ModelEditor, Settings deep field forms, AgentChat/AgentCreate, SkillEdit, CollectionDetails, Talk, Sound, biometrics, FineTune/Quantize, Users modals, Nodes/P2P install pickers, BackendLogs, Traces deep filters, Explorer) intentionally keep their inner content untranslated for now — they fall back to English via fallbackLng so functionality is unaffected, and the extracted-strings pattern + the bootstrap script make follow-up extraction straightforward. The initial Suspense fallback at the root in main.jsx covers the first JSON fetch on cold load. A simple `.app-boot-spinner` styled in App.css provides a non-empty paint while the first namespace loads. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit Write Agent] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-02 22:42:08 +02:00
LocalAI [bot]	1ad5b5907d	feat(swagger): update swagger (#9643 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-02 22:33:47 +02:00
Russell Sim	18e039f305	fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds (#9626 ) * fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds `399c1dec` wired amdgpu-targets through the backend_build workflow_call interface, intending the input's default value to cover matrix entries that don't specify targets. However, GitHub Actions only applies a workflow_call input default when the caller omits the input entirely. When backend.yml passes `amdgpu-targets: ${{ matrix.amdgpu-targets }}` and the matrix entry has no amdgpu-targets key, the expression evaluates to an empty string, which is treated as an explicit value — bypassing the default. The result is Docker receiving AMDGPU_TARGETS="" which in turn causes Make's ?= default to be skipped (since the variable is already set in the environment, even to empty), and cmake gets -DAMDGPU_TARGETS= with no targets, so the HIP backend compiles for an indeterminate target rather than the intended GPU list. Fix this at two levels: 1. backend.yml: use a \|\| fallback in the expression so that an undefined matrix.amdgpu-targets never reaches the reusable workflow as an empty string. The target list is the canonical default and lives here. 2. backend_build.yml: remove the now-misleading default value from the input declaration. The default never fired due to the above bug, so keeping it implied a guarantee that didn't exist. 3. backend/cpp/llama-cpp/Makefile: add an explicit $(error ...) guard after the ?= assignment so that if AMDGPU_TARGETS is empty (whether from environment or any future CI wiring mistake) the build fails immediately with a clear message rather than silently producing a binary compiled for an unknown GPU target. Assisted-by: Claude Code:claude-sonnet-4-6 Signed-off-by: Russell Sim <rsl@simopolis.xyz> * fix(build): plumb AMDGPU_TARGETS through to Docker builds The docker-build-backend Makefile macro and Dockerfile.golang did not pass AMDGPU_TARGETS to the inner make invocation, so hipblas builds always used the backend Makefile's hardcoded default GPU targets regardless of what was specified via environment or CI inputs. Signed-off-by: Russell Sim <rsl@simopolis.xyz> --------- Signed-off-by: Russell Sim <rsl@simopolis.xyz>	2026-05-02 15:53:14 +02:00
Ettore Di Giacinto	b1a99436c7	feat(branding): admin-configurable instance name, tagline, and assets (#9635 ) Adds a whitelabeling feature so an operator can replace the LocalAI instance name, tagline, square logo, horizontal logo, and favicon from the admin Settings page. Defaults fall back to the bundled assets so existing installs are unaffected. The public GET /api/branding endpoint is reachable pre-auth so the login screen can render the configured branding before sign-in. Mutating routes (POST/DELETE /api/branding/asset/:kind) remain admin-only. Text fields (instance_name, instance_tagline) ride the existing /api/settings flow; binary assets get a dedicated multipart upload route that persists files under DynamicConfigsDir/branding/. To prevent the Settings page's stale local state from clobbering an upload on save, UpdateSettingsEndpoint preserves whatever the on-disk asset filename fields are regardless of the body — /api/branding/asset/* are the sole writers for those fields. The MCP catalog gains get_branding and set_branding tools (text fields only; file upload stays UI-only) plus a configure_branding skill prompt. While wiring this up, the same restart-loss class of bug surfaced for several existing fields whose RuntimeSettings entries were never read by the startup loader. Fix loadRuntimeSettingsFromFile() to load: - branding (instance_name, instance_tagline, *_file basenames) - auto_upgrade_backends, prefer_development_backends - localai_assistant_enabled - open_responses_store_ttl - the 7 existing AgentPool fields (enabled, default/embedding model, chunking sizes, enable_logs, collection_db_path) Also exposes 3 new AgentPool runtime settings (vector_engine, database_url, agent_hub_url) via /api/settings + the Settings UI, with the same load-on-startup wiring. The file watcher's manual-edit path is intentionally not changed — the in-process API endpoints already update appConfig directly, so the watcher is redundant for supported flows and a separate refactor for everything else. 15 TDD specs cover the loader behaviour (1 branding + 11 adjacent + 3 new agent-pool); 2 specs cover the persistence helpers and the clobber-prevention contract. Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-02 15:51:36 +02:00
Ettore Di Giacinto	7325046650	fix(diffusers): drop compel from requirements to unblock pip resolver (#9632 ) compel 2.3.1 (latest, Nov 2025) declares transformers~=4.25 in its metadata, i.e. >=4.25,<5.0. After transformers 5.0 (2026-01-26) and huggingface-hub 1.0 (2025-10-27) shipped, the weekly DEPS_REFRESH cache rotation in CI started seeing the new majors and pip's resolver went into multi-hour backtracking storms walking every transformers 4.x candidate against every accelerate/hf-hub/tokenizers combination to find a set compel would accept. The 2026-04-29 backend-build for the diffusers backend (darwin-mps + l4t + cublas13-turboquant matrix cells) hit the GitHub Actions 6h job timeout still inside pip install — the build itself never started. compel is the only hard upper bound on transformers in this stack (diffusers, accelerate, peft, optimum-quanto are all flexible), and upstream support for transformers 5 is still in flight: damian0815/ compel#129 ("Modernize Compel for Transformers 5") and #128 ("Bump transformers version to >5.0") are both open as of today. backend.py only constructs Compel() when COMPEL=1 is set in the env (default off), so make compel a true optional extra: - Wrap the top-level `from compel import ...` in try/except ImportError, mirroring the existing sd_embed pattern. - Auto-disable COMPEL with a warning when the module isn't installed, instead of crashing on module load. - Drop compel from all eight requirements-*.txt variants so the resolver no longer has to satisfy its transformers cap. - Leave a TODO in backend.py and in each requirements file pointing at the upstream PR/issue, so the dependency can be reinstated once compel supports transformers >= 5. Users who rely on weighted-prompt embeddings can opt in with a manual `pip install compel` alongside COMPEL=1; the warning emitted on startup tells them how. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-01 14:45:14 +02:00
Ettore Di Giacinto	8452068f43	feat(importers): whisper.cpp HF repos pick a quant + nest under whisper/models (#9630 ) The WhisperImporter's Import() switch ordered LooksLikeURL ahead of the HuggingFace branch, so any https://huggingface.co/<owner>/<repo> URI (e.g. LocalAI-io/whisper-large-v3-it-yodas-only-ggml) hijacked the URL path. FilenameFromUrl returned the repo slug, the gallery entry pointed at the HTML repo page, the SHA256 was empty, and the HF file listing was effectively dead code for HTTPS imports. The HF branch only fired for huggingface://owner/repo and hf://owner/repo references. Gate the URL case on a "ggml-.bin" basename signal — mirroring how the llama-cpp importer gates on ".gguf" — so direct file URLs still take the URL path while HF repo URLs fall through to the HF branch. There the file listing is actually consulted: every ggml-.bin entry is collected and one is picked by the new preferences.quantizations preference (default q5_0; comma-separated for fallback ordering). Pin the chosen file under whisper/models/<name>/<file> so a single repo can ship q4_0/q5_0/q8_0 side-by-side without colliding on disk, matching the llama-cpp/models/<name>/ layout. The fallback when no preference matches is the last available ggml file, mirroring llama-cpp's pickPreferredGroup behaviour. Tests: replace the previous probe spec with positive assertions against LocalAI-io/whisper-large-v3-it-yodas-only-ggml (default → ggml-model-q5_0.bin, quantizations=q4_0 → ggml-model-q4_0.bin) plus two offline specs that build a fake hfapi.ModelDetails to cover the fallback rule and non-ggml filtering without touching the network. Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-05-01 12:03:07 +02:00
ER-EPR	0b0078047f	Add tags to qwen3-vl-reranker and Qwen3-VL-Embedding to the gallery (#9628 ) * Add tags to Qwen3-VL-Reranker models Added tags for reranker models in index.yaml. Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com> * Add Qwen3-VL-Embedding models to gallery Added Qwen3-VL-Embedding-8B and Qwen3-VL-Embedding-2B models with detailed descriptions and file references. Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com> * Update index.yaml Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com> --------- Signed-off-by: ER-EPR <38782737+ER-EPR@users.noreply.github.com>	2026-05-01 10:56:58 +02:00
Tai An	80961d2da6	feat(backends/python): use tempfile.gettempdir() instead of hardcoded /tmp (#9629 ) Closes #9601 Makes the temporary scratch paths in vllm, vllm-omni, tinygrad, and pocket-tts backends configurable via the standard TMPDIR env var, instead of always writing to /tmp. This is a one-line change per call site that calls tempfile.gettempdir() for the directory and keeps the same filename suffix. Users who run on systems with a small root partition (or want to relocate scratch files to a larger volume) can now redirect these by setting TMPDIR (e.g. TMPDIR=/data/tmp), without affecting the existing LOCALAI_GENERATED_CONTENT_PATH or LOCALAI_UPLOAD_PATH options that already cover other temp paths. Files touched: - backend/python/vllm/backend.py (1 site: video base64 scratch) - backend/python/tinygrad/backend.py (1 site: image fallback dst) - backend/python/pocket-tts/backend.py (1 site: tts wav fallback dst) - backend/python/vllm-omni/backend.py (2 sites: video + audio scratch)	2026-05-01 10:56:24 +02:00
LocalAI [bot]	9c4c3f9d8f	chore: ⬆️ Update ggml-org/llama.cpp to `beb42fffa45eded44804a1fd4916146222371581` (#9624 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-01 02:02:56 +02:00
LocalAI [bot]	273416f54b	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `a8aecbf15933295af96504f9a693998322185b5c` (#9625 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-01 02:02:29 +02:00
Ettore Di Giacinto	c02a50f2ab	feat(llama-cpp): bump to d775992 and adapt to spec params refactor (#9618 ) Bumps backend/cpp/llama-cpp/Makefile LLAMA_VERSION from 665abc6 to d775992, picking up upstream PR ggml-org/llama.cpp#22397 which splits common_params_speculative into nested draft / ngram_simple / ngram_mod sub-structs. Renames every grpc-server.cpp reference to match: speculative.mparams_dft.path -> speculative.draft.mparams.path speculative.{n_max,n_min} -> speculative.draft.{n_max,n_min} speculative.{p_min,p_split} -> speculative.draft.{p_min,p_split} speculative.{n_gpu_layers,n_ctx} -> speculative.draft.{n_gpu_layers,n_ctx} speculative.ngram_size_n -> speculative.ngram_simple.size_n speculative.ngram_size_m -> speculative.ngram_simple.size_m speculative.ngram_min_hits -> speculative.ngram_simple.min_hits The "speculative.n_max" JSON key sent to the upstream server stays unchanged — server-task.cpp still reads it and routes the value into draft.n_max internally. The turboquant fork (TheTom/llama-cpp-turboquant @ 11a241d) branched before #22397 and still exposes the flat layout. Since turboquant reuses the shared backend/cpp/llama-cpp/grpc-server.cpp, extend patch-grpc-server.sh with an idempotent sed block that reverts the ten field references back to the legacy flat names on the build copy only — the original under backend/cpp/llama-cpp/ stays compiling against vanilla upstream. Drop the block once the fork rebases. ik-llama-cpp has its own grpc-server.cpp with no speculative refs (0/2661 lines), so it is unaffected. Validated locally with `make docker-build-llama-cpp` (avx, avx2, avx512, fallback, grpc + rpc-server all built; image exported). Assisted-by: Claude:claude-opus-4-7 [Bash Read Edit] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-30 08:44:43 +02:00
LocalAI [bot]	76971fb2aa	chore: ⬆️ Update leejet/stable-diffusion.cpp to `3d6064b37ef4607917f8acf2ca8c8906d5087413` (#9617 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-30 08:43:42 +02:00
LocalAI [bot]	ebd9fcbe20	chore(model gallery): 🤖 add 1 new models via gallery agent (#9615 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-29 22:33:41 +02:00
Ettore Di Giacinto	091eda8d70	feat: react chat redesign (#9616 ) * feat(react-ui): redesign chat — popover history, focus on send, density pass Replace the persistent 260px conversation sidebar with a Cmd/Ctrl+K popover (ChatsMenu) so the conversation owns the page. Once a chat has at least one message we auto-collapse the global app rail and fade non-essential header chrome; Esc gives the user back the full chrome for the rest of the session. Move Canvas mode and the MCP dropdown into the input wrapper as mode chips — they describe what's armed for the next message and now live where the user composes. The chat header drops to Chats · title · ModelSelector · overflow · settings, and an overflow menu carries admin-only Manage mode along with Info / Edit / Export / Clear. Density pass: tighter header (40px), smaller avatars with the assistant left-border accent doing the work, 88% bubble width, modern field-sizing on the textarea, 32px send/stop buttons. Empty state now surfaces a Recent strip (top 4 non-empty chats) and a Cmd+K hint, replacing the discoverability the persistent sidebar used to provide. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * feat(react-ui): chat input chips, slimmer menu, focus mode polish Move Canvas mode and the MCP dropdown into the input wrapper as compact mode chips — they describe what's armed for the next message and now sit where the user composes. The MCP popover flips upward when anchored to the input row so it stays on-screen. Eliminate the chat header overflow ("…") menu entirely; relocate each item to its semantic home so users don't have to remember a miscellany drawer: - Manage mode toggle → top of the Settings drawer, alongside the other sticky chat knobs. The shield next to the title still signals state at a glance. - Model info / Edit config → small admin-only "ⓘ" button next to the ModelSelector; the existing model-info panel now hosts the Edit config link. - Export as Markdown → per-row hover action in ChatsMenu, so it works for any chat (not just the active one). - Clear chat history → destructive button at the bottom of the Settings drawer. Make the Sidebar listen to its own `sidebar-collapse` event so the chat's focus mode actually shrinks the rail (it previously only flipped the layout class, leaving the sidebar element at full width and overlapping the chat). Drop the focus-mode toast — the visual shift is enough; the toast was noise. Define `--color-text-tertiary` in both themes; without it metadata text (recent strip timestamps and a few other sites) was inheriting the platform default, which read as black on the dark surface. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * fix(model/log-store): close merged channel exactly once; clean up Remove Two latent races in BackendLogStore.Subscribe could panic under load (distributed e2e test triggered "send on closed channel" at backend_log_store.go:288): 1. The aggregated path closed the merged channel `ch` from two places — the fan-in waiter goroutine (after all source channels drained) and unsubscribe(). When unsubscribe ran while a fan-in goroutine was mid-flight on `ch <- line`, the close beat the send and the runtime panicked. Now `ch` is closed by exactly one goroutine: the waiter that observes all fan-in goroutines finish. unsubscribe() only closes the per-buffer source channels — the for-range in each fan-in goroutine then exits naturally and the waiter takes care of the merged close. 2. Remove() closed every subscriber channel but didn't delete the entries from the subscribers map, so a concurrent unsubscribe() would call close() again on the already-closed channel ("close of closed channel"). Clear the map entry while closing. Add a regression test that hammers AppendLine concurrently with Subscribe + unsubscribe + Remove; the race detector catches both classes of regression. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 * test(model/log-store): port backend log store tests to ginkgo Bring backend_log_store_test.go in line with the rest of pkg/model (loader_test, watchdog_test, store_test): same external test package (`model_test`), same ginkgo + gomega imports, same Describe/It nesting around the public API. Behaviour is unchanged — the four existing scenarios plus the unsubscribe race regression all run as specs under the existing `TestModel` suite. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-29 22:33:26 +02:00
Ettore Di Giacinto	fe6eb57082	feat(vibevoice-cpp): add purego TTS+ASR backend (#9610 ) * feat(vibevoice-cpp): add purego TTS+ASR backend Wire up Microsoft VibeVoice via the vibevoice.cpp C ABI as a new purego-based Go backend that serves both Backend.TTS and Backend.AudioTranscription from a single gRPC binary. Mirrors the qwen3-tts-cpp / sherpa-onnx pattern so the variant matrix (cpu/cuda12/cuda13/metal/rocm/sycl-f16/f32/vulkan/l4t) and the e2e-backends gRPC harness reuse existing infrastructure. - backend/go/vibevoice-cpp/ - Makefile, CMakeLists, purego shim, gRPC Backend with model-dir auto-detection, closed-loop TTS->ASR smoke test - backend/index.yaml - &vibevoicecpp meta + 18 image entries - Makefile - .NOTPARALLEL, BACKEND_VIBEVOICE_CPP, docker-build wiring, test-extra-backend-vibevoice-cpp-{tts,transcription} e2e wrappers - .github/workflows/backend.yml - matrix entries for all variants - .github/workflows/test-extra.yml - per-backend smoke + 2 gRPC e2e jobs * feat(vibevoice-cpp): drop hardcoded glob detection, add gallery entries Refactor backend Load() to follow the standard Options[] convention used by sherpa-onnx and the rest of the multi-role backends: ModelFile is the primary gguf, supplementary paths come through opts.Options[] as key=value (or key:value for Make-target compat), resolved against opts.ModelPath. type=asr/tts decides the role of ModelFile when neither tts_model nor asr_model is set explicitly. Add gallery/index.yaml entries: - vibevoice-cpp - realtime 0.5B Q8_0 TTS + tokenizer + Carter voice - vibevoice-cpp-asr - long-form ASR Q8_0 + tokenizer Both pull from huggingface://mudler/vibevoice.cpp-models with sha256 verification. parameters.model + Options[] paths are siblings under {models_dir} per the qwen3-tts-cpp convention. Update Makefile e2e wrappers to pass BACKEND_TEST_OPTIONS comma+colon style, and tighten the per-backend Go closed-loop test to use the explicit Options API. * fix(vibevoice-cpp): force whole-archive link so vv_capi_* exports survive libvibevoice is a STATIC archive linked into the MODULE library. Without --whole-archive (or -force_load on Apple, /WHOLEARCHIVE on MSVC), the linker garbage-collects symbols not referenced from this translation unit - which means dlopen+RegisterLibFunc panics with 'undefined symbol: vv_capi_load' at backend startup, since purego looks them up by name and our cpp/govibevoicecpp.cpp doesn't call them directly. * test(vibevoice-cpp): rewrite suite with Ginkgo v2 Match the convention used by backend/go/sherpa-onnx/backend_test.go. The suite now covers backend semantics that don't need purego (Locking, empty-ModelFile rejection, TTS/ASR-without-loaded-model errors) on top of the gRPC lifecycle specs (Health, Load, closed-loop TTS->ASR). Model-dependent specs Skip() when VIBEVOICE_MODEL_DIR is unset, so `go test ./backend/go/vibevoice-cpp/` is green on a clean checkout and runs the heavyweight closed-loop spec when test.sh has staged the bundle. * fix(vibevoice-cpp): implement TTSStream + AudioTranscriptionStream The gRPC server's stream handlers (pkg/grpc/server.go) spawn a goroutine that ranges over a chan; the only thing closing that chan is the backend's own Stream method. With the default Base stub returning 'unimplemented' and never touching the chan, the server goroutine hangs forever and the client hits DeadlineExceeded - which is exactly what the e2e harness saw in the test-extra-backend-vibevoice-cpp-tts matrix run. TTSStream synthesizes via vv_capi_tts to a tempfile, then emits a streaming WAV header (chunk sizes 0xFFFFFFFF so HTTP clients can start playback before the full PCM lands) followed by the PCM body in 64 KB slices. The header + >=2 PCM frames satisfy the harness's 'expected >=2 chunks' assertion and give a real progressive stream. AudioTranscriptionStream runs the offline transcription, emits each segment as a delta, and closes with a final_result whose Text equals the concatenated deltas (the harness asserts those match). Two new Ginkgo specs guard the close-channel-on-error path so the deadline-exceeded regression can't come back silently. fix(vibevoice-cpp): silence errcheck on cleanup paths Lint flagged six unchecked Close()/Remove()/RemoveAll() calls along purely-cleanup deferred paths. Wrap each in '_ = ...' (or a closure for defers that take args) - matches what the rest of the LocalAI backend/go/* tree already does for these callsites. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(vibevoice-cpp): closed-loop slot fill + modelRoot-relative path resolution Two bugs the test-extra-backend-vibevoice-cpp-* CI matrix surfaced: 1. Closed-loop Load with ModelFile=tts.gguf + Options[asr_model=...] left v.ttsModel empty, because the default-fill block only ran when BOTH slots were empty. vv_capi_load then got tts="" + a voice and the C side rejected it with rc=-3 'TTS model required to load a voice'. Fix: ModelFile fills the primary role-slot (decided by 'type=' in Options, defaulting to tts) independently of the secondary, so ModelFile + asr_model resolves to both. 2. resolvePath stat'd CWD before falling back to relTo. With LocalAI launched from a directory that happens to contain a same-named file, supplementary Options[] paths could leak away from the models dir. Drop the CWD probe entirely - relative paths now always join onto opts.ModelPath (the gallery convention). New Ginkgo coverage: * 'ModelFile slot resolution' (4 specs) - asr_model+ModelFile, type=asr, explicit tts_model override, key:value variant. * 'resolvePath (relative-to-modelRoot)' (5 specs) - join, abs passthrough, empty input, empty relTo, and the CWD-trap regression test. * 'Load resolves relative Options paths against opts.ModelPath' - end- to-end gallery layout round-trip. Verified locally: 19/19 specs pass (with model bundle, including the closed-loop TTS->ASR; without bundle, 17 pass + 2 model-dependent skip). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(vibevoice-cpp): use gallery convention in closed-loop spec The 'loads the realtime TTS model' / closed-loop specs were passing already-prefixed paths into Options[]: Options: ['tokenizer=' + filepath.Join(modelDir, 'tokenizer.gguf')] Combined with no ModelPath set on the request, the backend's modelRoot fell back to filepath.Dir(ModelFile) = modelDir, then resolvePath joined the prefixed Options path on top of it - producing 'vibevoice-models/vibevoice-models/tokenizer.gguf' when the CI's VIBEVOICE_MODEL_DIR is the relative './vibevoice-models'. The fix is to mirror the gallery contract LocalAI core actually sends in production: ModelPath is the models root (absolute), ModelFile is a name under it, every Options[] path is relative to ModelPath. Uses filepath.Base() to get bare filenames. Verified locally with both VIBEVOICE_MODEL_DIR=/tmp/vv-bundle (abs) and VIBEVOICE_MODEL_DIR=vibevoice-models (the relative shape that broke CI). Both: 19/19 specs pass, ~55-60s. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): switch ASR to Q4_K + bump transcription timeout The Q8_0 ASR gguf is ~14 GB - too big to fit alongside the runner image, the docker build cache, and the test artifacts on a free ubuntu-latest GHA runner; 'test-extra-backend-vibevoice-cpp-transcription' was getting SIGTERM'd at 90 min before the model could finish loading. Switch to Q4_K (~10 GB on disk, slightly faster CPU decode) for: * the e2e harness Make target * the gallery 'vibevoice-cpp-asr' entry (parameters + files block) * the per-backend test.sh auto-download list Bump tests-vibevoice-cpp-grpc-transcription's timeout-minutes from 90 to 150 - even with Q4_K, the 30 s JFK clip on a CPU runner needs runway above the previous 90 min cap. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): drop transcription gRPC e2e job - too heavy for free runners The vibevoice ASR is a 7B-parameter model. Even on Q4_K (~10 GB on disk) a single 30 s transcription saturates the per-test 30 min timeout in the e2e-backends harness on a 4-core ubuntu-latest, and the 10 GB download + Docker layer + working space leaves no headroom on the runner's free disk. Two attempts in CI got SIGTERM'd at the LoadModel boundary - the bottleneck isn't tunable from the workflow side without a paid-tier runner. The per-backend tests-vibevoice-cpp job already runs the same AudioTranscription path via a closed-loop TTS->ASR Ginkgo spec - same gRPC contract, same model, single process - so the standalone tests-vibevoice-cpp-grpc-transcription job was redundant on top of the disk/CPU pressure. The Makefile target test-extra-backend-vibevoice-cpp-transcription stays for local invocation on workstations that can afford it - useful when developing the streaming codepaths. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): restore transcription gRPC e2e on bigger-runner Switch tests-vibevoice-cpp-grpc-transcription from ubuntu-latest to the self-hosted 'bigger-runner' label that GPU image builds in backend.yml use, plus the documented Free-disk-space prep step (purge dotnet / ghc / android / CodeQL caches) the disabled vllm/sglang entries in this file describe. That gives the 7B-param Q4_K ASR model the disk + CPU runway it needs. Keep timeout-minutes: 150 - even on a beefier runner the 30 s JFK decode plus 10 GB download has to fit comfortably. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * ci(vibevoice-cpp): apt-get install make on bigger-runner before transcription e2e bigger-runner is a self-hosted bare runner without the standard ubuntu image's preinstalled build tools, so the previous job died at the very first command with 'make: command not found' (exit 127). Add the Dependencies step that the disabled vllm/sglang entries in this file already document - apt-get installs make + build-essential + curl + unzip + ca-certificates + git + tar before the make target runs. Mirrors how every other 'runs-on: bigger-runner' entry in backend.yml prepares the runner. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-29 22:22:14 +02:00
LocalAI [bot]	13fe37df89	chore(model gallery): 🤖 add 1 new models via gallery agent (#9611 ) chore(model gallery): 🤖 add new models via gallery agent Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-29 09:06:22 +02:00
Richard Palethorpe	4916f8c880	feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map (#9563 ) * feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map LocalAI's vLLM backend wraps a small typed subset of vLLM's AsyncEngineArgs (quantization, tensor_parallel_size, dtype, etc.). Anything outside that subset -- pipeline/data/expert parallelism, speculative_config, kv_transfer_config, all2all_backend, prefix caching, chunked prefill, etc. -- requires a new protobuf field, a Go struct field, an options.go line, and a backend.py mapping per feature. That cadence is the bottleneck on shipping vLLM's production feature set. Add a generic `engine_args:` map on the model YAML that is JSON-serialised into a new ModelOptions.EngineArgs proto field and applied verbatim to AsyncEngineArgs at LoadModel time. Validation is done by the Python backend via dataclasses.fields(); unknown keys fail with the closest valid name as a hint. dataclasses.replace() is used so vLLM's __post_init__ re-runs and auto-converts dict values into nested config dataclasses (CompilationConfig, AttentionConfig, ...). speculative_config and kv_transfer_config flow through as dicts; vLLM converts them at engine init. Operators can now write: engine_args: data_parallel_size: 8 enable_expert_parallel: true all2all_backend: deepep_low_latency speculative_config: method: deepseek_mtp num_speculative_tokens: 3 kv_cache_dtype: fp8 without further proto/Go/Python plumbing per field. Production defaults seeded by hooks_vllm.go: enable_prefix_caching and enable_chunked_prefill default to true unless explicitly set. Existing typed YAML fields (gpu_memory_utilization, tensor_parallel_size, etc.) remain for back-compat; engine_args overrides them when both are set. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(vllm): pin cublas13 to vLLM 0.20.0 cu130 wheel vLLM's PyPI wheel is built against CUDA 12 (libcudart.so.12) and won't load on a cu130 host. Switch the cublas13 build to vLLM's per-tag cu130 simple-index (https://wheels.vllm.ai/0.20.0/cu130/) and pin vllm==0.20.0. The cu130-flavoured wheel ships libcudart.so.13 and includes the DFlash speculative-decoding method that landed in 0.20.0. cublas13 install gets --index-strategy=unsafe-best-match so uv consults both the cu130 index and PyPI when resolving — PyPI also publishes vllm==0.20.0, but with cu12 binaries that error at import time. Verified: Qwen3.5-4B + z-lab/Qwen3.5-4B-DFlash loads and serves chat completions on RTX 5070 Ti (sm_120, cu130). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(vllm): bot job to bump cublas13 vLLM wheel pin vLLM's cu130 wheel index URL is itself version-locked (wheels.vllm.ai/<TAG>/cu130/, no /latest/ alias upstream), so a vLLM bump means rewriting two values atomically — the URL segment and the version constraint. bump_deps.sh handles git-sha-in-Makefile only; add a sibling bump_vllm_wheel.sh and a matching workflow job that mirrors the existing matrix's PR-creation pattern. The bumper queries /releases/latest (which excludes prereleases), strips the leading 'v', and seds both lines unconditionally. When the file is already on the latest tag the rewrite is a no-op and peter-evans/create-pull-request opens no PR. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> * docs(vllm): document engine_args and speculative decoding The new engine_args: map plumbs arbitrary AsyncEngineArgs through to vLLM, but the public docs only covered the basic typed fields. Add a short subsection in the vLLM section explaining the typed/generic split and showing a worked DFlash speculative-decoding config, with pointers to vLLM's SpeculativeConfig reference and z-lab's drafter collection. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-29 00:49:28 +02:00
LocalAI [bot]	55afda22e3	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `453a027c17e4d63a7f16b871197a396240a65138` (#9608 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-29 00:18:19 +02:00
LocalAI [bot]	1fe3558ec6	feat(swagger): update swagger (#9607 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-29 00:18:02 +02:00
Ettore Di Giacinto	e370318bd7	fix(vllm): seed pybind11 for fastsafetensors build under --no-build-isolation fastsafetensors==0.3 (transitive dep of vllm) imports pybind11 in setup.py without declaring it in build-system.requires. With --no-build-isolation it has to already exist in the venv, otherwise the wheel build fails with ModuleNotFoundError on arm64 L4T CUDA 13 (and any other profile that picks up vllm 0.20.0). Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-28 20:08:26 +00:00
Richard Palethorpe	4443250756	chore: add golangci-lint with new-from-merge-base baseline (#9603 ) * chore: add golangci-lint with new-from-merge-base baseline Configure golangci-lint v2 with the standard linter set (errcheck, govet, ineffassign, unused) plus forbidigo, which enforces the Ginkgo/Gomega-only test convention from .agents/coding-style.md by rejecting stdlib testing calls (t.Errorf, t.Fatalf, t.Run, ...). staticcheck is disabled — the codebase has many pre-existing QF-style suggestions not worth gating on. issues.new-from-merge-base = master makes the lint job a gate for new issues only; the ~1300 pre-existing baseline stays visible via 'make lint-all' for incremental cleanup. CI runs 'make lint'. Backends needing C/C++ headers we don't install in the lint runner are excluded via a deny list in the Makefile (backend/go/{piper,silero-vad, llm}, cmd/launcher). Discovery still flows through 'go list ./...', so new packages are scanned automatically. To make backend/go/{sam3-cpp,stablediffusion-ggml,whisper} typecheckable, move their .cpp/.h sources into cpp/ subdirs (matching qwen3-tts-cpp / acestep-cpp). Without this 'go list' rejects the package because Go does not allow .cpp alongside .go without cgo. Fix two real bugs found by lint in tests/integration/ (run only via 'make test-stores', not default CI): a stale zerolog reference left over from the slog migration (`c37785b7`) and an unused 'os' import. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> * ci(lint): generate proto sources and fetch full history The lint job was failing for two reasons: - pkg/grpc/proto/.go is generated, not checked in. Several packages import it, so without 'make protogen-go' typecheck fails project-wide with "no required module provides package github.com/mudler/LocalAI/ pkg/grpc/proto". - golangci-lint's new-from-merge-base needs to git-merge-base the PR against master, but actions/checkout's default shallow clone doesn't fetch master. fetch-depth: 0 brings full history; the config now references origin/master (the remote-tracking branch that survives the shallow checkout) instead of bare master (which doesn't exist locally after checkout). Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> ci(lint): stub react-ui/dist for go:embed glob core/http/app.go has //go:embed react-ui/dist/*. The glob must match at least one non-hidden entry or typecheck fails the whole core/http package. We don't need the real React bundle to lint Go code, so just touch an empty index.html to satisfy the embed. Assisted-by: Claude Code:Opus 4.7 (1M) [Bash] [Read] [Edit] [Write] Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-28 22:07:44 +02:00
Ettore Di Giacinto	bcef72b9c1	feat: localai assistant chat modality (#9602 ) * fix(tests): inline model_test fixtures after tests/models_fixtures removal The previous reorg removed tests/models_fixtures/ but core/config/model_test.go still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so `make test` failed with "open : no such file or directory" on the readConfigFile spec (the suite ran with --fail-fast and bailed before openresponses_test). Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test file, materialise them into a per-test tmpdir via BeforeEach, and drop the env-var lookups. The test no longer depends on Makefile plumbing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash] * refactor(modeladmin): extract model-admin helpers into a service package Lift the bodies of EditModelEndpoint, PatchConfigEndpoint, ToggleStateModelEndpoint, TogglePinnedModelEndpoint and VRAMEstimateEndpoint into core/services/modeladmin so the same logic can be called by non-HTTP clients (notably the in-process MCP server that backs the LocalAI Assistant chat modality, landing in a follow-up commit). The HTTP handlers shrink to thin shells that parse echo inputs, call the matching helper, map typed errors (ErrNotFound, ErrConflict, ErrPathNotTrusted, ErrBadAction, ...) to the existing HTTP status codes, and render the existing response shapes. No REST-surface behaviour change; the existing localai endpoint tests cover the regression net. Adds focused unit tests for each helper against tmp-dir-backed ModelConfigLoader fixtures (deep-merge patch, rename + conflict, path separator guard, toggle/pin enable/disable, sync callback). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat(assistant): LocalAI Assistant chat modality with in-memory MCP server Adds a chat modality, admin-only, that wires the chat session to an in-memory MCP server exposing LocalAI's own admin/management surface as tools. An admin can install models, manage backends, edit configs and check status by chatting; the LLM calls tools like gallery_search, install_model, import_model_uri, list_installed_models, edit_model_config and surfaces the results. Same Go package powers two modes: pkg/mcp/localaitools/ NewServer(client, opts) builds an MCP server that registers the 19-tool admin catalog. The LocalAIClient interface has two impls: - inproc.Client — calls services directly (no HTTP loopback, no synthetic admin API key). Used in-process by the chat handler. - httpapi.Client — calls the LocalAI REST API. Used by the new `local-ai mcp-server --target=…` subcommand to control a remote LocalAI from a stdio MCP host. Tools and their embedded skill prompts are agnostic to which client backs them. Skill prompts are markdown files under prompts/, embedded via go:embed and assembled into the system prompt at server init. Wiring: - core/http/endpoints/mcp/localai_assistant.go — process-wide holder that spins up the in-memory MCP server once at Application start using paired net.Pipe transports, then reuses LocalToolExecutor (no fork) for every chat request that opts in. - core/http/endpoints/openai/chat.go — small branch ahead of the existing MCP block: when metadata.localai_assistant=true, defense-in-depth admin check + executor swap + system-prompt injection. All downstream tool dispatch is unchanged. - core/http/auth/{permissions,features}.go — adds FeatureLocalAIAssistant; gating happens at the chat handler entry plus admin-only `/api/settings`. - core/cli/{run.go,cli.go,mcp_server.go} — LOCALAI_DISABLE_ASSISTANT flag (runtime-toggleable via Settings, no restart), plus `local-ai mcp-server` stdio subcommand. - core/config/runtime_settings.go — `localai_assistant_enabled` runtime setting; the chat handler reads `DisableLocalAIAssistant` live at request entry. UI: - Home.jsx — prominent self-explanatory CTA card on first run ("Manage LocalAI by chatting"); collapses to a compact "Manage by chat" button in the quick-links row once used, persisted via localStorage. - Chat.jsx — admin-only "Manage" toggle in the chat header, "Manage mode" badge, dedicated empty-state copy, starter chips. - Settings.jsx — "LocalAI Assistant" section with the runtime enable toggle. - useChat.js — `localaiAssistant` flag on the chat schema; injects `metadata.localai_assistant=true` on requests when active. Distributed mode: the in-memory MCP server lives only on the head node; inproc.Client wraps already-distributed-aware services so installs propagate to workers via the existing GalleryService machinery. Documentation: `.agents/localai-assistant-mcp.md` is the contributor contract — when adding an admin REST endpoint, also add a LocalAIClient method, an inproc + httpapi impl, a tool registration, and a skill prompt update; the AGENTS.md index links to it. Out of scope (follow-ups): per-tool RBAC granularity for non-admin read-only access; streaming mcp_tool_progress for long installs; React Vitest rig for the UI changes. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): extract tool/capability/MiB/server-name constants The MCP tool surface, capability tag set, server-name default, and the chat-handler metadata key were repeated as bare string literals across seven files. Renaming any one required hand-editing every call site and risked code/test/prompt drift. This pulls them into typed constants: - pkg/mcp/localaitools/tools.go — Tool* constants for the 19 MCP tools, plus DefaultServerName. - pkg/mcp/localaitools/capability.go — typed Capability + constants for the capability tag set the LLM passes to list_installed_models. The type rides through LocalAIClient.ListInstalledModels and replaces the triplet of "embed"/"embedding"/"embeddings" with the single CapabilityEmbeddings. - pkg/mcp/localaitools/inproc/client.go — bytesPerMiB constant for the VRAMEstimate byte→MB conversion. - core/http/endpoints/mcp/tools.go — MetadataKeyLocalAIAssistant for the "localai_assistant" request-metadata key consumed by the chat handler. Tool registrations, the test catalog, the dispatch table, the validation fixtures, and the fake/stub clients all reference the constants. The embedded skill prompts under prompts/ keep their bare strings (go:embed markdown can't import Go constants); the existing TestPromptsContain SafetyAnchors guards the alignment. No behaviour change. All tests pass with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(modeladmin): typed Action for ToggleState/TogglePinned The toggle/pin verbs were bare strings everywhere — handler signatures, service implementations, MCP tool args, the fake/stub clients, the inproc and httpapi LocalAIClient impls, plus 4 test files. A typo in any caller silently fell through to the runtime "must be 'enable' or 'disable'" check. Introduce core/services/modeladmin.Action (string alias) with ActionEnable, ActionDisable, ActionPin, ActionUnpin and a small Valid helper. The compiler now catches mismatches at every boundary; renames ripple through one source of truth. LocalAIClient.ToggleModelState/Pinned signatures change to take modeladmin.Action. The package is brand-new and unreleased so this is a free public-API tightening. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): respect ctx cancellation on gallery channel sends InstallModel, DeleteModel, ImportModelURI, InstallBackend and UpgradeBackend all pushed onto galleryop channels with bare sends. If the worker was paused or the buffer full, the chat-handler goroutine blocked forever — the LLM kept polling and the request leaked. Wrap the five sends in a sendModelOp/sendBackendOp helper that selects on ctx.Done() so a cancelled chat completion surfaces context.Canceled back to the LLM instead of hanging. Adds inproc/client_test.go with a pre-cancelled-ctx regression test on InstallModel; the helpers are shared so the same guarantee covers the other four call sites. Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): graceful shutdown for in-memory holder and stdio CLI Two related leaks: - Application.start() built the LocalAIAssistantHolder but never wired Close() into the graceful-termination chain — the in-memory MCP transport pair stayed alive until process exit, and the goroutines behind net.Pipe() didn't drain. Hook into the existing signals.RegisterGracefulTerminationHandler chain (same pattern as core/http/endpoints/mcp/tools.go:770). - core/cli/mcp_server.go ran srv.Run with context.Background(); a Ctrl-C from the host (Claude Desktop, mcphost, npx inspector) or a SIGTERM from process supervision left the stdio loop reading from a closed pipe. Switch to signal.NotifyContext to surface the signal through ctx and let srv.Run drain. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix(assistant): typed HTTPError + propagate prompt walk error The httpapi client detected "no such job" by substring-matching on the error string ("404", "could not find") — brittle to status-code formatting changes and to LocalAI fixing /models/jobs/:uuid to return a proper 404. Replace with a typed HTTPError whose Is() method honours errors.Is(err, ErrHTTPNotFound). The 500-with-"could not find" branch stays as a transitional fallback documented in Is(). Same change covers ListNodes' 404 fallback for the /api/nodes endpoint. Adds httptest tests for both 404 and the legacy 500 path, plus a direct errors.Is exposure test so external callers (the standalone stdio CLI host) can match without re-string-parsing. Also tightens prompts.SystemPrompt: panic when fs.WalkDir on the embedded FS fails. The only realistic cause is a build-time //go:embed misconfiguration; serving an empty system prompt to the LLM is much worse than crashing init. TestSystemPromptIncludesAllEmbeddedFiles catches regressions in CI. Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> fix(modeladmin): atomic writes for model config files The five sites that wrote model YAML used os.WriteFile, which opens with O_TRUNC\|O_WRONLY\|O_CREATE. A crash mid-write left the destination truncated and the model unloadable until manual repair. Pre-existing behaviour inherited from the original endpoint handlers — fix once now that there's a single helper. Adds writeFileAtomic: writes to a sibling temp file, chmods, syncs via Close(), then os.Rename. Same-directory temp keeps the rename atomic on the same filesystem; cleanup runs on every error path so stray temps don't accumulate. No new dependency. Applied to: - ConfigService.PatchConfig - ConfigService.EditYAML (both rename and in-place branches) - mutateYAMLBoolFlag (drives ToggleState + TogglePinned) atomic_test.go covers the happy path plus a read-only-dir failure case that asserts the original file is preserved (skipped on Windows where the chmod trick is POSIX-specific). Assisted-by: Claude:claude-opus-4-7 [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(assistant): prune dead code, mark stub, document conventions Three small cleanups landing together: - Drop the unused errNotImplemented sentinel from inproc/client.go. All five methods that used to return it are wired to modeladmin helpers since the Phase B commit; the package var is dead. - Annotate httpapi.Client.GetModelConfig as a known stub. LocalAI's /models/edit/:name returns rendered HTML, not JSON, so the standalone CLI's get_model_config tool surfaces a clear error to the LLM. A future JSON-only /api/models/config-yaml/:name endpoint is tracked in the agent contract; FIXME points at it. - Extend `.agents/localai-assistant-mcp.md` with a "Code conventions" section that documents the audit-driven rules: tool/Capability/Action constants, errors.Is over substring matching, ctx-aware channel sends, atomic writes, and graceful shutdown. Refresh the file map so it lists tools.go and capability.go and drops the removed tools_bootstrap.go. The tools_models.go diff is a comment-only change explaining why the ModelName empty-string check stays at the tool layer (consistency across LocalAIClient implementations, since the SDK schema validator only enforces presence, not non-empty). Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test(assistant): convert test files to ginkgo + gomega The repo convention (per core/http/endpoints/localai/_test.go, core/gallery/, etc.) is Ginkgo v2 with Gomega assertions. The tests I introduced for the assistant feature used vanilla testing.T, which made them stand out and stripped the BDD structure the rest of the suite relies on. Convert every test file in the assistant scope to Ginkgo: pkg/mcp/localaitools/ dto_test.go — Describe("DTOs round-trip through JSON") prompts_test.go — Describe("SystemPrompt assembler") server_test.go — Describe("Server tool catalog"), Describe("Tool dispatch"), Describe("Tool error surfacing"), Describe("Argument validation"), Describe("Concurrent tool calls") parity_test.go — Describe("LocalAIClient parity"), hosts the suite's single RunSpecs (the file is package localaitools_test so it can import httpapi without an import cycle; Ginkgo aggregates Describes from both the internal and external test packages into one run). httpapi/client_test.go — Describe("httpapi.Client against the LocalAI admin REST surface"), Describe("ErrHTTPNotFound"), Describe("Bearer token") inproc/client_test.go — Describe("inproc.Client cancellation") core/services/modeladmin/ config_test.go — Describe("ConfigService") with sub-Describes for GetConfig, PatchConfig, EditYAML state_test.go — Describe("ConfigService.ToggleState") pinned_test.go — Describe("ConfigService.TogglePinned") atomic_test.go — Describe("writeFileAtomic") core/http/endpoints/mcp/ localai_assistant_test.go — Describe("LocalAIAssistantHolder") Each package gets a `_suite_test.go` with the standard `RegisterFailHandler(Fail) + RunSpecs(t, "...")` boilerplate. Helpers that previously took testing.T (newTestService, writeModelYAML, readMap, sortedStrings, sortGalleries, etc.) drop the T receiver and use Gomega Expectations directly. tmp dirs come from GinkgoT().TempDir(). No semantic change to test coverage — every original assertion has a direct Gomega counterpart. All suites pass with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * test+docs(assistant): drift detector for Tool ↔ REST route mapping Honest gap from the audit: the parity_test.go suite only checks four methods, and uses the same httpapi.Client for both sides — it asserts stability of the DTO shapes, not equivalence between in-process and HTTP. If a contributor adds an admin REST endpoint without an MCP tool, or a tool without a matching httpapi route, both surfaces silently diverge. Add a coverage test plus stronger docs: - pkg/mcp/localaitools/coverage_test.go introduces a hand-maintained toolToHTTPRoute map: every Tool* constant must list the REST endpoint the httpapi.Client hits (or "(none)" with a documented reason). Two Ginkgo specs assert the map and the published catalog stay in sync — one fails when a Tool is added without a route entry, the other fails when a route entry references a tool that no longer exists. Verified by removing the ToolDeleteModel entry locally; the test fired with a clear message pointing the contributor at the file. Deliberate non-test: we don't enumerate live admin REST routes from here. Walking the route registry requires booting Application; parsing core/http/routes/localai.go is brittle. The "new admin REST endpoint → MCP tool" direction stays a PR checklist item — see below. - AGENTS.md gets a new Quick Reference bullet that calls out the rule and points at the test by name. - .agents/api-endpoints-and-auth.md tightens the existing "Companion: MCP admin tool surface" subsection from "if useful, consider..." to "MUST be considered, with three concrete outcomes (tool added, deliberately skipped with documented reason, or forgot — which breaks the contract)". Adds a checklist item at the bottom of the file's authoritative checklist. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Write] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): drop duplicate DTOs, surface canonical types Audit feedback: localaitools/dto.go reinvented several types that already existed in the codebase. Replace the duplicates with the canonical types so the LLM-visible wire format stays aligned with the rest of LocalAI by construction (no parallel structs to keep in sync). Removed (and the canonical type now used by the LocalAIClient interface): localaitools.Gallery → config.Gallery localaitools.GalleryModelHit → gallery.Metadata localaitools.VRAMEstimate → vram.EstimateResult Tightened scope: localaitools.Backend → kept, but reduced to {Name, Installed}. ListKnownBackends now returns []schema.KnownBackend (the canonical type already used by REST /backends/known). Kept with documented rationale: localaitools.JobStatus — galleryop.OpStatus has Error error which marshals to "{}". JobStatus is the JSON-friendly mirror. localaitools.Node — nodes.BackendNode carries gorm internals + token hash; we expose only the LLM-relevant fields. ImportModelURIRequest/Response — schema.ImportModelRequest and GalleryResponse are wire-shaped, mine are LLM-shaped (BackendPreference flat, AmbiguousBackend exposed). Side wins: - Drop bytesPerMiB; vram.EstimateResult already carries human-readable display strings (size_display, vram_display) the LLM uses directly. - Drop the handler-private vramEstimateRequest in core/http/endpoints/localai/vram.go and bind directly into modeladmin.VRAMRequest (now JSON-tagged). Both clients pass through these types now where possible (e.g. ListGalleries in inproc.Client is a one-liner returning AppConfig.Galleries; httpapi.Client.GallerySearch decodes straight into []gallery.Metadata). All tests green with -race. Assisted-by: Claude:claude-opus-4-7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * refactor(assistant): extract REST route paths into named constants httpapi.Client had 18 bare-string path sites scattered across methods. Pull them into pkg/mcp/localaitools/httpapi/routes.go: static paths as package-private constants, dynamic paths as small builders that handle url.PathEscape on segment values. No behaviour change. Drops the now-unused net/url import from client.go since path escaping moved into routes.go alongside the path it applies to. Local-only by design: the server-side registrations in core/http/routes/localai.go remain bare strings. Sharing constants across the pkg/ ↔ core/ boundary would invert the layering today; the existing Tool↔REST drift-detector in coverage_test.go is the safety net for that direction. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] * docs(assistant): align with shipped UI and dropped bootstrap env vars The LocalAI Assistant doc still described the older iteration: - The in-chat toggle was renamed from "Admin" to "Manage" (the badge is now "Manage mode" and the home page exposes a "Manage by chat" CTA). - LOCALAI_ASSISTANT_BOOTSTRAP_MODEL / --localai-assistant-bootstrap-model and the bootstrap_default_model tool were removed — admins pick a model from the existing selector instead, no env-var configuration required. - The shipped tool catalog includes import_model_uri but didn't appear in the doc; bootstrap_default_model appeared but no longer exists. - The Settings → LocalAI Assistant runtime toggle wasn't mentioned as the preferred way to disable without restart. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-28 19:29:27 +02:00
Ettore Di Giacinto	142919fc79	fix(tests): inline model_test fixtures after tests/models_fixtures removal The previous reorg removed tests/models_fixtures/ but core/config/model_test.go still read CONFIG_FILE/MODELS_PATH env vars pointing into that directory, so `make test` failed with "open : no such file or directory" on the readConfigFile spec (the suite ran with --fail-fast and bailed before openresponses_test). Inline the YAMLs (config/embeddings/grpc/rwkv/whisper) directly into the test file, materialise them into a per-test tmpdir via BeforeEach, and drop the env-var lookups. The test no longer depends on Makefile plumbing. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]	2026-04-28 12:58:49 +00:00
dependabot[bot]	439471baec	chore(deps): bump packaging from 24.1 to 26.2 in /backend/python/coqui (#9594 ) Bumps [packaging](https://github.com/pypa/packaging) from 24.1 to 26.2. - [Release notes](https://github.com/pypa/packaging/releases) - [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst) - [Commits](https://github.com/pypa/packaging/compare/24.1...26.2) --- updated-dependencies: - dependency-name: packaging dependency-version: '26.2' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-28 08:44:53 +02:00
dependabot[bot]	eff4be6794	chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.1 to 2.28.2 (#9593 ) Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.28.1 to 2.28.2. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](https://github.com/onsi/ginkgo/compare/v2.28.1...v2.28.2) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.28.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-28 08:44:41 +02:00
dependabot[bot]	f1ec30d646	chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres from 0.41.0 to 0.42.0 (#9591 ) chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres Bumps [github.com/testcontainers/testcontainers-go/modules/postgres](https://github.com/testcontainers/testcontainers-go) from 0.41.0 to 0.42.0. - [Release notes](https://github.com/testcontainers/testcontainers-go/releases) - [Commits](https://github.com/testcontainers/testcontainers-go/compare/v0.41.0...v0.42.0) --- updated-dependencies: - dependency-name: github.com/testcontainers/testcontainers-go/modules/postgres dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-28 08:44:27 +02:00
LocalAI [bot]	f3500223d7	chore: ⬆️ Update leejet/stable-diffusion.cpp to `a81677f59c92d90343aebca51dfed7decf0a0cb0` (#9586 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:44:10 +02:00
LocalAI [bot]	b69bacfcdc	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `d6f3e4e28fbf75e6181e6ea32e734de9ce9304fd` (#9585 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:43:51 +02:00
LocalAI [bot]	8e50066fa2	chore: ⬆️ Update ggml-org/llama.cpp to `665abc609740d397d30c0d8ef4157dbf900bd1a3` (#9584 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-28 08:43:33 +02:00
Ettore Di Giacinto	a0317d9926	refactor(tests): split app_test.go, move real-backend coverage to e2e-backends core/http/app_test.go had grown to 1495 lines exercising three concerns at once: HTTP-layer integration, real-backend inference (llama-gguf, tts, stablediffusion, transformers embeddings, whisper), and service logic that already has unit-level coverage. Each PR paid for 6 backend builds plus real-model downloads to satisfy a single suite. Reorg per layer: - app_test.go (1495 -> 1003 lines) drives the mock-backend binary only. Kept: auth, routing, gallery API, file:// import, /system, agent-jobs HTTP plumbing, config-file model loading. Deleted real-inference specs (llama-gguf chat, ggml completions/streaming, logprobs, logit_bias, transcription, embeddings, External-gRPC, Stores duplicate, Model gallery Context). Lifted Agent Jobs out of the deleted Stores Context. - tests/e2e-backends/backend_test.go gains logprobs, logit_bias, and no-first-token-dup specs (the latter folded into PredictStream). Two new caps gate them so non-LLM backends opt out. - tests/e2e-aio/e2e_test.go gains a streaming smoke under Context("text") to catch container-level streaming regressions. - tests/models_fixtures/ removed; all fixtures referenced testmodel.ggml. app_test.go now writes per-Context inline mock-model YAMLs. CI: - test.yml + tests-e2e.yml gain paths-ignore (docs/, examples/, *.md, backend/) so docs and backend-only PRs skip them. test.yml drops the 6-backend Build step plus TRANSFORMER_BACKEND/GO_TAGS=tts; tests-apple drops the llama-cpp-darwin build. - New tests-aio.yml runs the AIO container nightly + on workflow_dispatch + master/tags. The tests-e2e-container job moved out of test.yml so PRs no longer pay AIO cost. - New tests-llama-cpp-smoke job in test-extra.yml runs on every PR with no detect-changes gate; pulls quay.io/go-skynet/local-ai-backends: master-cpu-llama-cpp (no build on PR) and exercises predict/stream/ logprobs/logit_bias against Qwen3-0.6B. This is the PR-acceptance real-backend gate after AIO moved to nightly. The path-gated heavy test-extra-backend-llama-cpp wrapper appends the same caps so it exercises the moved specs when the backend actually changes. Makefile: - Deleted test-models/testmodel.ggml (the wget chain), test-llama-gguf, test-tts, test-stablediffusion, test-realtime-models. test target drops --label-filter, HUGGINGFACE_GRPC, TRANSFORMER_BACKEND, TEST_DIR, FIXTURES, CONFIG_FILE, MODELS_PATH, BACKENDS_PATH; depends on build-mock-backend. test-stores keeps a focused entry point and depends on backends/local-store. clean-tests also clears the mock-backend binary. Net per typical Go-side PR: ~25min (6 backend builds + tests + AIO) + ~8min e2e drops to ~5min mock-backend test + ~8min e2e + ~5-10min llama-cpp-smoke (image pulled). Docs and backend-only PRs skip the always-on workflows entirely. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 [Edit] [Write] [Bash]	2026-04-27 23:09:20 +00:00
Ettore Di Giacinto	3948b580d2	fix(distributed): worker stopBackend/isRunning resolve bare modelID to replica keys PR #9583 changed the supervisor's process map key from `modelID` to `modelID#replicaIndex`, but the NATS lifecycle handlers kept passing the bare modelID: * `backend.stop` (subscribeLifecycleEvents): `s.stopBackend(req.Backend)` → `s.processes["Qwen3.6-..."]` missed (actual key is "...#0") → silent no-op. Admin "Unload model" clicks released VRAM via model.unload but left the gRPC process alive on its old port. Subsequent chats hit installBackend, found the leftover process, reused its address — and the UI reported "no models loaded" while the model kept responding. * `backend.delete` (subscribeLifecycleEvents): same map miss in `isRunning(req.Backend)` and `s.stopBackend(req.Backend)` — admin "Delete backend" deleted the binary while the process was still serving traffic. Add `resolveProcessKeys(id)`: exact match if `id` is a full processKey (stopAllBackends iterates the map and passes its own keys); prefix-match if `id` is bare (NATS handlers); empty if `id` contains `#` but doesn't match (no spurious fallback when the caller was explicit). stopBackend and isRunning now call it; stopBackend gets a new stopBackendExact helper for per-key cleanup. TDD: regression test fails without the fix (resolveProcessKeys doesn't exist; map lookup by bare name returns nothing). Tests pass post-fix. Reproduced live: registry row count was 0 for the model the user "Unloaded", chat still served by the leftover worker process. SmartRouter behavior is correct in itself — it falls through to scheduleAndLoad when no row exists; the bug was that the leftover process corrupted the install path. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash]	2026-04-27 21:43:15 +00:00
LocalAI [bot]	5efbe8405f	feat(swagger): update swagger (#9587 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-27 23:28:03 +02:00
Ettore Di Giacinto	ea1df8945b	fix(distributed): preserve UI-added node labels across worker re-register The register endpoint called SetNodeLabels(req.Labels) — replace-all semantics — so every worker re-register wiped every label not in the worker's body. The bug existed since labels were introduced in PR #9186 (Mar 31), but only triggered for workers that supplied labels via --node-labels. PR #9583 (the multi-replica refactor) added an auto-mirrored `node.replica-slots` label to every worker's registration body, which made `len(req.Labels) > 0` always true — turning a latent edge-case bug into a universal one. Operators reported "labels assigned to node do not persist": labels survived until the next worker restart, then disappeared. Fix: iterate req.Labels and call SetNodeLabel (upsert) for each instead of SetNodeLabels (delete-then-recreate). Worker-managed labels still refresh on re-register; UI-added labels survive. Trade-off: an operator who removes a label from --node-labels won't have it auto-removed from the DB on next register — they can clean it via the UI. Acceptable, since the alternative (current behavior) silently destroys operator state. Regression test added first (TDD): RegisterNodeEndpoint registers a node, the test simulates a UI add via SetNodeLabel, then re-registers with a different worker label set; assertion that the UI-added label survives. Test fails against the broken code, passes against the fix. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash]	2026-04-27 21:24:50 +00:00
Ettore Di Giacinto	3280b9a287	fix(distributed): per-replica backend logs (store aggregation + UI) The multi-replica refactor (PR #9583) changed the worker's process key from `modelID` to `modelID#replicaIndex`, but the BackendLogStore kept the bare-modelID lookup. Result: every distributed deployment lost backend logs in the Nodes UI — single-replica too, since even the default capacity of 1 produces a `#0` suffix. Two changes wired together: * pkg/model: BackendLogStore.GetLines/Subscribe now treat a modelID without `#` as a model prefix and merge across all `modelID#N` replica buffers (timestamp-sorted for GetLines; fan-in for Subscribe). Calls with a full `modelID#N` key resolve exactly. ListModels strips replica suffixes and deduplicates so the listing surfaces one entry per loaded model. * react-ui: per-replica log streams as the default. Loaded Models table disambiguates each row with a `rep N` pill (only when the node hosts >1 replica of a model). Each row's "View logs" link routes to the per-replica process key so operators see only that replica's output. The logs page renders the replica context as a chip in the title and surfaces a segmented control — `Replica 0 / 1 / … / All merged` — when the model has multiple replicas; the merged segment uses the bare-modelID URL (delegating to the store's prefix aggregation) for the side-by-side comparison case. Single-replica deployments see no extra UI. Tests added first (TDD): the regression set in backend_log_store_test.go reproduces the bug at the exact failure point — GetLines/ListModels/Subscribe assertions all fail against the broken code, all pass against the fix. TestSubscribe_PerReplicaFilter pins the exact-key path so a future change can't silently break it. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:distill]	2026-04-27 20:55:24 +00:00
Ettore Di Giacinto	375bf1929d	fix(ui): hide meta-dev backends in System → Backends Development toggle The Manage view's flagsFor() short-circuited on b.IsMeta and returned dev=false for every meta backend, so meta-dev entries (e.g. llama-cpp-development, whisper-development, insightface-development) leaked through the Development toggle in distributed mode and stayed visible whether the toggle was on or off. The count chip even under-reported because those rows were excluded from it. Drop the IsMeta short-circuit and trust gallery enrichment for both flags. Production metas (llama-cpp) are tagged isAlias=false / isDevelopment=false in the gallery so they still pass both toggles; meta-dev entries carry isDevelopment=true and now correctly hide alongside concrete dev variants. Assisted-by: Claude:claude-opus-4-7 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-27 20:38:20 +00:00
Ettore Di Giacinto	9a7f5e68bd	ci(darwin): add native caches to backend_build_darwin macOS runners can't use the registry-backed BuildKit cache (no Docker daemon), so every darwin matrix run was paying full cost for brew installs, Go module downloads, llama.cpp recompiles and Python wheel resolution. Wires actions/cache@v4 into the reusable workflow for four caches: - Go modules + build cache (setup-go cache: true), shared across matrix - Homebrew downloads + selected /opt/homebrew/Cellar entries, with HOMEBREW_NO_AUTO_UPDATE so restored Cellar paths stay stable - ccache for the llama-cpp CMake variants, keyed on the pinned LLAMA_VERSION; CMAKE_*_COMPILER_LAUNCHER is exported via GITHUB_ENV so backend/cpp/llama-cpp/Makefile picks it up without script changes - Python uv + pip wheel cache, keyed by backend + ISO week — same one-cold-rebuild-per-week cadence as the Linux DEPS_REFRESH Read/write semantics match the existing BuildKit policy: every run restores, only master/tag pushes save, so PRs can't pollute master's warm cache. Documents the new caches and the macOS-specific constraints in .agents/ci-caching.md. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7[1m] [Claude Code]	2026-04-27 20:17:36 +00:00
Ettore Di Giacinto	6b63b47f61	feat(distributed): support multiple replicas of one model on the same node (#9583 ) * feat(distributed): support multiple replicas of one model on the same node The distributed scheduler implicitly assumed `(node_id, model_name)` was unique, but the schema didn't enforce it and the worker keyed all gRPC processes by model name alone. With `MinReplicas=2` against a single worker, the reconciler "scaled up" every 30s but the registry never advanced past 1 row — the worker re-loaded the model in-place every tick until VRAM fragmented and the gRPC process died. This change introduces multi-replica-per-node as a first-class concept, with capacity-aware scheduling, a circuit breaker, and VRAM soft-reservation. Operators can declare per-node capacity via the worker flag `--max-replicas-per-model` (mirrored as auto-label `node.replica-slots=N`) or override per-node from the UI. * Schema: BackendNode gains MaxReplicasPerModel (default 1) and ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id + model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for the reconciler circuit breaker. * Registry: replica_index threaded through SetNodeModel, RemoveNodeModel, IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel, SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers: CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot), RemoveAllNodeModelReplicas, FindNodesWithFreeSlot, ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with ErrInsufficientVRAM), and the unsatisfiable-flag CRUD. * Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads of the same model land on distinct ports. Adds CLI flag --max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1) and emits the auto-label. * Router: scheduleNewModel filters candidates by free slot, allocates the replica index, and soft-reserves VRAM before installing the backend. evictLRUAndFreeNode now deletes the targeted row by ID instead of all replicas of the model on the node — fixes a latent bug where evicting one replica orphaned its siblings. * Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig (MinReplicas > capacity) doesn't loop forever. After 3 consecutive ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and emits a warning. ClearAllUnsatisfiable fires from Register, ApproveNode, SetNodeLabel(s), RemoveNodeLabel and UpdateMaxReplicasPerModel so a new node joining or label changes wake the reconciler immediately. scaleDownIdle removes highest-replica-index first to keep slots compact. * Heartbeat resets reserved_vram to 0 — worker is the source of truth for actual free VRAM; the reservation is only for the in-tick race window between two scheduling decisions. * Probe path (reconciler.probeLoadedModels and health.doCheckAll) now pass the row's replica_index to RemoveNodeModel so an unreachable replica doesn't orphan healthy siblings. * Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a sticky override (preserved across worker re-registration). DELETE clears the override so the worker's flag applies again on next register. Required because Kong defaults the worker flag to 1, so every worker restart would have silently reverted the UI value. * React UI: always-visible slot badge on the node row (muted at default 1, accented when >1); inline editor in the expanded drawer with pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when the value is admin-set, and a "Reset" button to hand control back to the worker. Soft confirm when shrinking the cap below the count of loaded replicas. Scheduling rules table gets an "Unsatisfiable until HH:MM" status badge surfacing the cooldown. * node.replica-slots filtered out of the labels strip on the row to avoid duplicating the slot badge. 23 new Ginkgo specs (registry, reconciler, inflight, health) cover: multi-replica row independence, RemoveNodeModel of one replica preserving siblings, NextFreeReplicaIndex slot allocation including ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping and recovery on Register, scheduleDownIdle ordering, ClusterCapacity math, ReserveVRAM admission gating, Heartbeat reset, override survival across worker re-registration, and ResetMaxReplicasPerModel handing control back. Plus 8 stdlib tests for the worker processKey / CLI / auto-label. Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor worker (single 128 GiB node, MinReplicas=2): the reconciler now caps the scale-up at the cluster's actual capacity instead of looping. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing] * refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions * Capacity editor hint trimmed from operator-doc-style ("Sourced from the worker's `--max-replicas-per-model` flag. Changing it here makes it a sticky admin override that survives worker restarts." → "Saved values stick across worker restarts.") and the override-state copy similarly compressed. The full mechanic is no longer needed in the UI — the override pill carries the meaning and the docs cover the rest. * Node row actions migrated from an inline cluster of icon buttons (Drain / Resume / Trash) to the kebab ActionMenu used by /manage for per-row model actions, so dense Nodes tables stay clean. Approve stays as a prominent primary button — it's a stateful admission gate, not a routine action, and elevating it matches how /manage surfaces install-time decisions outside the menu. * The expanded drawer's Labels section now filters node.replica-slots out of the editable label list. The label is owned by the Capacity editor above; surfacing it again as an editable label invited confusion (the Capacity save would clobber any direct edit). Both backend and agent workers benefit — they share the row rendering path, so the action menu and label filter apply to both. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish] * fix(react-ui/nodes): suppress slot badge on agent workers Agent workers don't load models, so the per-node replica capacity is inapplicable to them. Showing "1× slots" on agent rows was a tiny inconsistency from the unified rendering path — gate the badge on node_type !== 'agent' so it only appears on backend workers. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] * refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form The expanded node drawer used to stack five panels — slot badge, filled capacity box, Loaded Models h4+empty-state, Installed Backends h4+empty-state, Labels h4+chips+form — making routine inspections feel like a control panel. The scheduling rule form wrapped its mode toggle as two 50%-width filled buttons that competed visually with the actual primary action. * Drawer: collapse three rarely-touched config zones (Capacity, Backends, Labels) into one `<details>` "Manage" disclosure (closed by default) with small uppercase eyebrow labels for each zone instead of parallel h4 sub-headings. Loaded Models stays as the at-a-glance headline with a single-line empty hint instead of a boxed empty state. CapacityEditor renders flat (no filled background) — the Manage disclosure provides framing. * Scheduling form: replace the chunky 50%-width button-tabs with the project's existing `.segmented` control (icon + label, sized to content). Mode hint becomes a single tied line below. Fields stack vertically with helper text under inputs and a hairline divider above the right-aligned Save / Cancel. The empty drawer collapses from ~5 stacked sections (~280px tall) to two lines (~80px). The scheduling form now reads as a designed dialog instead of raw building blocks. Both surfaces now match the typographic density and weight of the rest of the admin pages. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish] * feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox The native <select> made operators scroll through every gallery entry to find a model name. The project already has SearchableModelSelect (used in Studio/Talk/etc.) which combines free-text search with the gallery list and accepts typed model names that aren't installed yet — useful for pre-staging a scheduling rule before the node it'll run on has finished bootstrapping. Also drops the now-unused useModels import (the combobox manages the gallery hook internally). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] * refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips The Nodes page was rendering the same key=value chip pattern in two places with subtly different markup: the Labels editor in the expanded drawer and (post-distill) the Node Selector input in the scheduling form. The form's input was also a comma-separated string that operators were getting wrong. * Extract <KeyValueChips> as a fully controlled chip-builder. Parent owns the map and decides what onAdd/onRemove does — form state for the scheduling form, API calls for the live drawer Labels editor. Same visuals everywhere; one component to change when polish needs apply. * Replace the comma-separated Node Selector text input with KeyValueChips. Operators were copying syntax from docs and missing commas; the chip vocabulary makes the key=value structure self-documenting. * Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max replicas. Picked over a slider because replica counts are exact specs derived from VRAM math (operator decision, not a fuzzy estimate). The chips give one-click access to common values (1/2/3/4 for Min, 0=no-limit/2/4/8 for Max) without the slider's special-value problem (MaxReplicas=0 is categorical, not a position on a continuum). * Drop the now-unused labelInputs state in the Nodes page (the inline label editor's per-node draft state lived there and is now owned by KeyValueChips). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill] * test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright) Two breakages caught by CI that didn't surface in the local run: * tests/e2e/distributed/.go — multiple files used the pre-PR2 registry signatures for SetNodeModel / IncrementInFlight / DecrementInFlight / RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo and one stale adapter.InstallBackend call in node_lifecycle_test.go. All updated to pass replicaIndex=0 — these tests don't exercise multi-replica behavior, they just need to compile against the new signatures. The chip-builder tests in core/services/nodes/ already cover the multi-replica logic. core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the drawer's distill refactor moved Backends inside a "Manage" <details> disclosure that's collapsed by default. The test helper expanded the node row but never opened Manage, so the per-node backend table was never in the DOM. Helper now clicks `.node-manage > summary` after expanding the row. All 100 playwright tests pass locally; tests/e2e/distributed compiles clean. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:opus-4-7 [Edit] [Bash] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-27 21:20:05 +02:00
Ettore Di Giacinto	f4036fa83f	ci(python-backends): add weekly DEPS_REFRESH cache-buster The shared backend/Dockerfile.python ends in: RUN cd /${BACKEND} && PORTABLE_PYTHON=true make which `pip install`s each backend's requirements*.txt. A scan of all 34 Python backends shows every single one ships at least some unpinned deps (torch, transformers, vllm, diffusers, ...). With the registry cache now enabled, that `make` layer's BuildKit hash depends only on Dockerfile instructions + COPYed source — not on what pip resolves at runtime — so a warm cache would freeze upstream versions indefinitely. DEPS_REFRESH is an ARG declared right before that RUN. backend_build.yml computes `date -u +%Y-W%V` (ISO week, e.g. `2026-W17`) and passes it as a build-arg, so the install layer invalidates at most once per week and re-resolves PyPI / nightly indexes. Within a week, builds stay warm. Only Dockerfile.python is affected: Go (go.sum) and Rust (Cargo.lock) already lock their deps, and the C++ backends pull gRPC at a pinned tag and llama.cpp at a pinned commit. Add .agents/ci-caching.md documenting the cache layout (quay.io/go-skynet/ci-cache:cache<tag-suffix>), read/write semantics (master writes, PRs read-only), DEPS_REFRESH semantics, and how to manually evict tags. Index it from AGENTS.md (CLAUDE.md is a symlink). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7-1m	2026-04-27 14:21:11 +00:00
Ettore Di Giacinto	3810fe1a1e	fix(distributed): worker container healthcheck always unhealthy The Dockerfile's HEALTHCHECK probes http://localhost:8080/readyz, which is the OpenAI API server port. When the same image runs as a worker, it listens on the gRPC base port (50051) and an HTTP file transfer server on port-1 (50050) — nothing on 8080 — so docker always reports the container as unhealthy. Add unauthenticated /readyz and /healthz endpoints to the worker's HTTP file transfer server, and override HEALTHCHECK_ENDPOINT for worker-1 in the distributed compose file. Disable the healthcheck for agent-worker since it is NATS-only and exposes no HTTP server. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-27 13:51:57 +00:00
Ettore Di Giacinto	bdfa5e934a	ci: switch image/backend build cache to a dedicated registry image - Switch cache-from/cache-to in backend_build.yml and image_build.yml from the unused gha cache to type=registry pointing at quay.io/go-skynet/ci-cache:cache<tag-suffix>, mode=max with ignore-error=true. Master/tag builds populate their own per-matrix-entry cache; PR builds read-only. - Drop the broken generate_grpc_cache.yaml cron. It targeted a `grpc` Dockerfile stage that was removed by `b1fc5acd` in July 2025, has been failing every night since, and never populated the gha cache. The new registry-cache scheme is self-warming, so no separate populator is needed. - Remove the dead GRPC_VERSION / GRPC_BASE_IMAGE / GRPC_MAKEFLAGS build-args from image_build.yml and the orphan ARG GRPC_BASE_IMAGE in the root Dockerfile (the root Dockerfile no longer compiles gRPC; the source build now lives in backend/Dockerfile.{llama-cpp, ik-llama-cpp, turboquant} only and uses its own ARG defaults). - Drop the unused grpc-base-image input from image_build.yml plus the matrix passthroughs in image.yml / image-pr.yml. - Drop the unused GRPC_VERSION env in test.yml. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7-1m	2026-04-27 13:13:04 +00:00
Richard Palethorpe	deca6dbdad	feat: Log backend exit code (#9581 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-27 14:19:18 +02:00
Ettore Di Giacinto	60549a8a60	feat(react-ui): page-width archetype system + mobile/tablet nav polish Replace the universal max-width:1200px cap on .page with a four-tier archetype system (narrow 760, medium 1080, default 1600, wide unbounded) selected per page based on what its UX actually wants. Data/table pages fill ultrawide displays; forms cap at reading width; tabbed feature surfaces breathe. Mobile/tablet: - New 640/1024 breakpoint split. Tablets (640-1023) get a persistent 52px icon rail; below 640 keeps the slide-off drawer. - Drawer polish: body-scroll lock, Escape to close, focus moves into the drawer on open and back to the hamburger on close, aria-hidden + inert on main while open. - Mobile top bar carries hamburger + theme toggle + account avatar (44x44 touch targets) so theme/account aren't trapped in the drawer. - Page-level reflow on phones: page-header column-stacks, filter chips scroll horizontally, tables go edge-to-edge, OperationsBar overflows rather than wrapping. Honors prefers-reduced-motion. Manage > Models: drop the toggle column; Enable/Disable joins the per-row Actions menu alongside Stop/Pin/Edit/Logs/Delete for consistency with the other action verbs. Page-width tokens live in theme.css so future tuning is one line. Removes 7 inline maxWidth workarounds from page roots. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude Code:claude-opus-4-7 [Edit] [Bash]	2026-04-27 11:51:29 +00:00
Ettore Di Giacinto	54728e292f	feat(react-ui): split Manage backends toggle into Variants and Development Meta backends are now always shown — they're the entries operators configure against — and two independent toggles govern the noise around them. "Variants" hides platform-specific concrete builds that a meta backend aliases on the host (e.g. llama-cpp-cuda12-12.4). "Development" hides pre-release `-development` builds. Each toggle shows the count of items currently hidden in its category. The legacy `bm` URL flag is honored on read so existing deep-links resolve to the same view they used to. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-27 08:23:53 +00:00
Tai An	86fd62233f	fix(gallery): correct Qwen3.5 typo in qwen3.5-27b-claude-4.6 model override (closes #9362 ) (#9580 ) The overrides.parameters.model field referenced 'Qwen3.-27B-Claude-...' (missing the '5'), so model loads failed because the configured filename did not match the file actually downloaded by the entry's files: list ('Qwen3.5-27B-Claude-...'). Aligns the override filename with the files: entries and with the upstream HF repo (mradermacher/Qwen3.5-27B-...).	2026-04-27 09:24:00 +02:00
Alex Brick	41ed8ced70	[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) (#9578 ) Update additional intel base images	2026-04-27 09:18:57 +02:00
LocalAI [bot]	05e94bd9e7	chore: ⬆️ Update ggml-org/llama.cpp to `f53577432541bb9edc1588c4ef45c66bf07e4468` (#9577 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-27 08:57:24 +02:00
Ettore Di Giacinto	8d124d080f	feat(gallery): add whisper-development umbrella stanza Mirrors the whisper capabilities map with -development variants so clients can pull the master-tagged whisper.cpp backend via a single platform-resolved name, matching the existing faster-whisper-development and whisperx-development entries. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-26 23:04:27 +00:00
Ettore Di Giacinto	2da1a4d230	feat(distributed): per-node backend installation from the gallery In distributed mode the Backends gallery used to fan every install out to every worker — fine for auto-resolving (meta) backends like llama-cpp where each node picks its own variant, but wrong for hardware-specific builds like cpu-llama-cpp that would silently land on every GPU node. Adds a node-targeted install path through the existing POST /api/nodes/:id/backends/install plumbing, with two entry points: - Backends gallery row gets a split-button in distributed mode. Auto- resolving keeps "Install on all nodes" as the primary; chevron menu opens the picker. Hardware-specific routes the primary directly to the picker — no fan-out path on the row. - Nodes-page drawer gets a "+ Add backend" button that navigates to /app/backends?target=<node-id>; the gallery scopes itself to that node (banner, single per-row install button, Reinstall/Remove for already- installed). One gallery, two scopes — no second UI to maintain. The picker (new NodeInstallPicker) shows a 3-state suitability column (Compatible / Override / Installed), an auto-expanding variant override disclosure that fires when selected nodes have no working GPU, parallel per-node installs with inline status and Retry-failed-nodes, and a mismatch confirm that names the consequence on the button itself. A 409 fan-out guard on /api/backends/apply protects CLI/Terraform/script users from the same footgun: hardware-specific installs in distributed mode now return code "concrete_backend_requires_target" with a human- readable error and a meta_alternative pointer. The gallery list payload now surfaces capabilities, metaBackendFor and per-row nodes (NodeBackendRef) so the picker and the new Nodes column have everything they need without re-walking the gallery client-side. GODEBUG=netdns=go is set on the compose services because the cgo DNS resolver follows the container's nsswitch.conf to host systemd-resolved (127.0.0.53), unreachable from inside the container; the pure-Go resolver reads /etc/resolv.conf directly and uses Docker's embedded DNS. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude Code:claude-opus-4-7[1m] [Edit] [Bash] [Read] [Write]	2026-04-26 22:05:18 +00:00
Ettore Di Giacinto	988430c850	test(react-ui): drive Manage page Backend logs link via the new kebab menu Manage page row actions moved into ActionMenu in `b336d9c6`, so the inline `<a title="Backend logs">` the e2e specs were asserting on no longer exists. Open the row's kebab and assert against the menuitem. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-7	2026-04-26 20:51:01 +00:00

1 2 3 4 5 ...

6212 Commits