LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-04-17 13:28:31 -04:00

Author	SHA1	Message	Date
dependabot[bot]	efae3fd97b	chore(deps): bump dompurify Bumps the npm_and_yarn group with 1 update in the /core/http/react-ui directory: [dompurify](https://github.com/cure53/DOMPurify). Updates `dompurify` from 3.3.2 to 3.4.0 - [Release notes](https://github.com/cure53/DOMPurify/releases) - [Commits](https://github.com/cure53/DOMPurify/compare/3.3.2...3.4.0) --- updated-dependencies: - dependency-name: dompurify dependency-version: 3.4.0 dependency-type: direct:production dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com>	2026-04-16 06:24:27 +00:00
dependabot[bot]	ab326a9c61	chore(deps): bump the npm_and_yarn group across 1 directory with 6 updates (#9373 ) Bumps the npm_and_yarn group with 6 updates in the /core/http/react-ui directory: \| Package \| From \| To \| \| --- \| --- \| --- \| \| [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) \| `6.4.1` \| `6.4.2` \| \| [@hono/node-server](https://github.com/honojs/node-server) \| `1.19.11` \| `1.19.14` \| \| [flatted](https://github.com/WebReflection/flatted) \| `3.3.4` \| `3.4.2` \| \| [hono](https://github.com/honojs/hono) \| `4.12.7` \| `4.12.14` \| \| [path-to-regexp](https://github.com/pillarjs/path-to-regexp) \| `8.3.0` \| `8.4.2` \| \| [picomatch](https://github.com/micromatch/picomatch) \| `4.0.3` \| `4.0.4` \| Updates `vite` from 6.4.1 to 6.4.2 - [Release notes](https://github.com/vitejs/vite/releases) - [Changelog](https://github.com/vitejs/vite/blob/v6.4.2/packages/vite/CHANGELOG.md) - [Commits](https://github.com/vitejs/vite/commits/v6.4.2/packages/vite) Updates `@hono/node-server` from 1.19.11 to 1.19.14 - [Release notes](https://github.com/honojs/node-server/releases) - [Commits](https://github.com/honojs/node-server/compare/v1.19.11...v1.19.14) Updates `flatted` from 3.3.4 to 3.4.2 - [Commits](https://github.com/WebReflection/flatted/compare/v3.3.4...v3.4.2) Updates `hono` from 4.12.7 to 4.12.14 - [Release notes](https://github.com/honojs/hono/releases) - [Commits](https://github.com/honojs/hono/compare/v4.12.7...v4.12.14) Updates `path-to-regexp` from 8.3.0 to 8.4.2 - [Release notes](https://github.com/pillarjs/path-to-regexp/releases) - [Changelog](https://github.com/pillarjs/path-to-regexp/blob/master/History.md) - [Commits](https://github.com/pillarjs/path-to-regexp/compare/v8.3.0...v8.4.2) Updates `picomatch` from 4.0.3 to 4.0.4 - [Release notes](https://github.com/micromatch/picomatch/releases) - [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md) - [Commits](https://github.com/micromatch/picomatch/compare/4.0.3...4.0.4) --- updated-dependencies: - dependency-name: vite dependency-version: 6.4.2 dependency-type: direct:development dependency-group: npm_and_yarn - dependency-name: "@hono/node-server" dependency-version: 1.19.14 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: flatted dependency-version: 3.4.2 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: hono dependency-version: 4.12.14 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: path-to-regexp dependency-version: 8.4.2 dependency-type: indirect dependency-group: npm_and_yarn - dependency-name: picomatch dependency-version: 4.0.4 dependency-type: indirect dependency-group: npm_and_yarn ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-16 08:23:03 +02:00
Ettore Di Giacinto	ad3c8c4832	fix(agents): handle embedding model dim changes on collection upload (#9365 ) Bumps LocalAGI to pick up the LocalRecall postgres backend fix that resizes the pgvector column when the configured embedding model returns vectors of a different dimensionality than the existing collection. Switching the agent pool's embedding model now triggers a transparent re-embed at startup instead of failing every subsequent upload with 'expected N dimensions, not M' (SQLSTATE 22000). Also surfaces a 409 with an actionable message in UploadToCollectionEndpoint as a safety net for the rare cases the upstream migration path doesn't cover (e.g. a model swapped at runtime), instead of the previous opaque 500.	2026-04-15 20:05:28 +02:00
Ettore Di Giacinto	410d100cc3	chore(ui): improve visibility of forms, color palette Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 21:53:03 +00:00
Ettore Di Giacinto	87e6de1989	feat: wire transcription for llama.cpp, add streaming support (#9353 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 16:13:40 +02:00
Ettore Di Giacinto	016da02845	feat: refactor shared helpers and enhance MLX backend functionality (#9335 ) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.	2026-04-13 18:44:03 +02:00
Ettore Di Giacinto	d67623230f	feat(vllm): parity with llama.cpp backend (#9328 ) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.	2026-04-13 11:00:29 +02:00
Ettore Di Giacinto	2865f0f8d3	feat(ux): backend management enhancement (#9325 ) * feat: add PreferDevelopmentBackends setting, expose isMeta/isDevelopment in API - Add PreferDevelopmentBackends config field, CLI flag, runtime setting - Add IsDevelopment() method to GalleryBackend - Use AvailableBackendsUnfiltered in UI API to show all backends - Expose isMeta, isDevelopment, preferDevelopmentBackends in backend API response * feat: upgrade banner with Upgrade All button, detect pre-existing backends - Add upgrade banner on Backends page showing count and Upgrade All button - Fix upgrade detection for backends installed before version tracking: flag as upgradeable when gallery has a version but installed has none - Fix OCI digest check to flag backends with no stored digest as upgradeable	2026-04-12 00:35:22 +02:00
Ettore Di Giacinto	8ab0744458	feat: backend versioning, upgrade detection and auto-upgrade (#9315 ) * feat: add backend versioning data model foundation Add Version, URI, and Digest fields to BackendMetadata for tracking installed backend versions and enabling upgrade detection. Add Version field to GalleryBackend. Add UpgradeAvailable/AvailableVersion fields to SystemBackend. Implement GetImageDigest() for lightweight OCI digest lookups via remote.Head. Record version, URI, and digest at install time in InstallBackend() and propagate version through meta backends. * feat: add backend upgrade detection and execution logic Add CheckBackendUpgrades() to compare installed backend versions/digests against gallery entries, and UpgradeBackend() to perform atomic upgrades with backup-based rollback on failure. Includes Agent A's data model changes (Version/URI/Digest fields, GetImageDigest). * feat: add AutoUpgradeBackends config and runtime settings Add configuration and runtime settings for backend auto-upgrade: - RuntimeSettings field for dynamic config via API/JSON - ApplicationConfig field, option func, and roundtrip conversion - CLI flag with LOCALAI_AUTO_UPGRADE_BACKENDS env var - Config file watcher support for runtime_settings.json - Tests for ToRuntimeSettings, ApplyRuntimeSettings, and roundtrip * feat(ui): add backend version display and upgrade support - Add upgrade check/trigger API endpoints to config and api module - Backends page: version badge, upgrade indicator, upgrade button - Manage page: version in metadata, context-aware upgrade/reinstall button - Settings page: auto-upgrade backends toggle * feat: add upgrade checker service, API endpoints, and CLI command - UpgradeChecker background service: checks every 6h, auto-upgrades when enabled - API endpoints: GET /backends/upgrades, POST /backends/upgrades/check, POST /backends/upgrade/:name - CLI: `localai backends upgrade` command, version display in `backends list` - BackendManager interface: add UpgradeBackend and CheckUpgrades methods - Wire upgrade op through GalleryService backend handler - Distributed mode: fan-out upgrade to worker nodes via NATS * fix: use advisory lock for upgrade checker in distributed mode In distributed mode with multiple frontend instances, use PostgreSQL advisory lock (KeyBackendUpgradeCheck) so only one instance runs periodic upgrade checks and auto-upgrades. Prevents duplicate upgrade operations across replicas. Standalone mode is unchanged (simple ticker loop). * test: add e2e tests for backend upgrade API - Test GET /api/backends/upgrades returns 200 (even with no upgrade checker) - Test POST /api/backends/upgrade/:name accepts request and returns job ID - Test full upgrade flow: trigger upgrade via API, wait for job completion, verify run.sh updated to v2 and metadata.json has version 2.0.0 - Test POST /api/backends/upgrades/check returns 200 - Fix nil check for applicationInstance in upgrade API routes	2026-04-11 22:31:15 +02:00
Ettore Di Giacinto	5c35e85fe2	feat: allow to pin models and skip from reaping (#9309 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-11 08:38:17 +02:00
Leigh Phillips	062e0d0d00	feat: Add toggle mechanism to enable/disable models from loading on demand (#9304 ) * feat: add toggle mechanism to enable/disable models from loading on demand Implements #9303 - Adds ability to disable models from being auto-loaded while keeping them in the collection. Backend changes: - Add Disabled field to ModelConfig struct with IsDisabled() getter - New ToggleModelEndpoint handler (PUT /models/toggle/:name/:action) - Request middleware returns 403 when disabled model is requested - Capabilities endpoint exposes disabled status Frontend changes: - Toggle switch in System > Models table Actions column - Visual indicators: dimmed row, red Disabled badge, muted icons - Tooltip describes toggle function on hover - Loading state while API call is in progress * fix: remove extra closing brace causing syntax error in request middleware * refactor: reorder Actions column - Stop button before toggle switch * refactor: migrate from toggle to toggle-state per PR review feedback	2026-04-10 18:17:41 +02:00
Ettore Di Giacinto	9748a1cbc6	fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299 ) When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token produces a JSON array with two elements: a role-init chunk and the actual content chunk. The grpc-server loop called attach_chat_deltas for both elements with the same raw_result pointer, stamping the first token's ChatDelta.Content on both replies. The Go side accumulated both, emitting the first content token twice to SSE clients. Fix: in the array iteration loops in PredictStream, detect role-init elements (delta has "role" key) and skip attach_chat_deltas for them. Only content/reasoning elements get chat deltas attached. Reasoning models are unaffected because their first token goes into reasoning_content, not content.	2026-04-10 08:45:47 +02:00
Ettore Di Giacinto	e1a6010874	fix(streaming): deduplicate tool call emissions during streaming (#9292 ) The Go-side incremental JSON parser was emitting the same tool call on every streaming token because it lacked the len > lastEmittedCount guard that the XML parser had. On top of that, the post-streaming default: case re-emitted all tool calls from index 0, duplicating everything. This produced duplicate delta.tool_calls events causing clients to accumulate arguments as "{args}{args}" — invalid JSON. Fixes: - JSON incremental parser: add len(jsonResults) > lastEmittedCount guard and loop from lastEmittedCount (matching the XML parser pattern) - Post-streaming default: case: skip i < lastEmittedCount entries that were already emitted during streaming - JSON parser: use blocking channel send (matching XML parser behavior)	2026-04-10 00:44:25 +02:00
Ettore Di Giacinto	706cf5d43c	feat(sam.cpp): add sam.cpp detection backend (#9288 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-09 21:49:11 +02:00
Ettore Di Giacinto	13a6ed709c	fix: thinking models with tools returning empty content (reasoning-only retry loop) (#9290 ) When clients like Nextcloud or Home Assistant send requests with tools to thinking models (e.g. Gemma 4 with <\|channel>thought tags), the response was empty despite the backend producing valid content. Root cause: the C++ autoparser puts clean content in both the raw Response and ChatDeltas. The Go-side PrependThinkingTokenIfNeeded then prepends the thinking start token to the already-clean content, causing ExtractReasoning to classify the entire response as unclosed reasoning. This made cbRawResult empty, triggering a retry loop that never succeeds. Two fixes: - inference.go: check ChatDeltas for content/tool_calls regardless of whether Response is empty, so skipCallerRetry fires correctly - chat.go: when ChatDeltas have content but no tool calls, use that content directly instead of falling back to the empty cbRawResult	2026-04-09 18:30:31 +02:00
Ettore Di Giacinto	85be4ff03c	feat(api): add ollama compatibility (#9284 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-09 14:15:14 +02:00
Ettore Di Giacinto	e00ce981f0	fix: try to add whisperx and faster-whisper for more variants (#9278 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 21:23:38 +02:00
Ettore Di Giacinto	39c6b3ed66	feat: track files being staged (#9275 ) This changeset makes visible when files are being staged, so users are aware that the model "isn't ready yet" for requests. Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 14:33:58 +02:00
Ettore Di Giacinto	0e9d1a6588	chore(ci): drop unnecessary test Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 12:19:54 +00:00
Ettore Di Giacinto	154fa000d3	fix(autoscaling): extract load model from Route() and use as well when doing autoscale (#9270 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-08 08:27:51 +02:00
Richard Palethorpe	9ac1bdc587	feat(ui): Interactive model config editor with autocomplete (#9149 ) * feat(ui): Add dynamic model editor with autocomplete Signed-off-by: Richard Palethorpe <io@richiejp.com> * chore(docs): Add link to longformat installation video Signed-off-by: Richard Palethorpe <io@richiejp.com> --------- Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-07 14:42:23 +02:00
Ettore Di Giacinto	505c417fa7	fix(gpu): better detection for MacOS and Thor (#9263 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-07 00:39:07 +02:00
Ettore Di Giacinto	0f9d516a6c	fix(anthropic): do not emit empty tokens and fix SSE tool calls (#9258 ) This fixes Claude Code compatibility Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-07 00:38:21 +02:00
Ettore Di Giacinto	92f99b1ec3	fix(token): login via legacy api keys (#9249 ) We were not checking against the api keys when db == nil. This commit also cleanups now unused middleware Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-06 21:45:09 +02:00
Ettore Di Giacinto	773489eeb1	fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244 ) * fix(chat): do not retry if we had chatdeltas or tooldeltas from backend Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: use oai compat for llama.cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * fix: apply to non-streaming path too Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * map also other fields Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-06 10:52:23 +02:00
Ettore Di Giacinto	232e324a68	fix(autoparser): correctly pass by logprobs (#9239 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-05 09:39:22 +02:00
Ettore Di Giacinto	53deeb1107	fix(reasoning): suppress partial tag tokens during autoparser warm-up The C++ PEG parser needs a few tokens to identify the reasoning format (e.g. "<\|channel>thought\n" for Gemma 4). During this warm-up, the gRPC layer was sending raw partial tag tokens to Go, which leaked into the reasoning field. - Clear reply.message in gRPC when autoparser is active but has no diffs yet, matching llama.cpp server behavior of only emitting classified output - Prefer C++ autoparser chat deltas for reasoning/content in all streaming paths, falling back to Go-side extraction for backends without autoparser (e.g. vLLM) - Override non-streaming no-tools result with chat delta content when available - Guard PrependThinkingTokenIfNeeded against partial tag prefixes during streaming accumulation - Reorder default thinking tokens so <\|channel>thought is checked before <\|think\|> (Gemma 4 templates contain both)	2026-04-04 20:45:57 +00:00
Ettore Di Giacinto	c5a840f6af	fix(reasoning): warm-up Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 20:25:24 +00:00
Ettore Di Giacinto	6d9d77d590	fix(reasoning): accumulate and strip reasoning tags from autoparser results (#9227 ) fix(reasoning): acccumulate and strip reasoning tags from autoparser results Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 18:15:32 +02:00
Ettore Di Giacinto	6f304d1201	chore(refactor): use interface (#9226 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 17:29:37 +02:00
Richard Palethorpe	557d0f0f04	feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-04-04 15:14:35 +02:00
Ettore Di Giacinto	b7e3589875	fix(anthropic): show null index when not present, default to 0 (#9225 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 15:13:17 +02:00
Ettore Di Giacinto	716ddd697b	feat(autoparser): prefer chat deltas from backends when emitted (#9224 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 12:12:08 +02:00
Ettore Di Giacinto	223deb908d	fix(nats): improve error handling (#9222 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 12:11:54 +02:00
Ettore Di Giacinto	9f8821bba8	feat(gemma4): add thinking support (#9221 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 12:11:38 +02:00
Ettore Di Giacinto	84e51b68ef	fix(ui): pass by staticApiKeyRequired to show login when only api key is configured (#9220 ) This fixes #9213 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-04 12:11:22 +02:00
github-actions[bot]	57c0026715	chore: bump inference defaults from unsloth (#9219 ) Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-04 09:44:12 +02:00
Ettore Di Giacinto	6c635e8353	feat: add resume endpoint to undrain nodes (#9197 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-01 18:21:43 +02:00
Ettore Di Giacinto	6b6c136210	fix(inflight): count inflight from load model, but release afterwards (#9194 ) This should fix the count of 1 in flight always showing in the node list Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 23:24:45 +02:00
Ettore Di Giacinto	e587ecc485	chore(ui): allow to unload forcefully Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 17:20:53 +00:00
Ettore Di Giacinto	221ff0f28f	feat(ui): show cluster status in home in distributed mode Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 15:37:58 +00:00
Ettore Di Giacinto	16d5cb00bd	chore: css cleanups Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 16:37:38 +02:00
Richard Palethorpe	952635fba6	feat(distributed): Avoid resending models to backend nodes (#9193 ) Signed-off-by: Richard Palethorpe <io@richiejp.com> Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-03-31 16:28:13 +02:00
Ettore Di Giacinto	3cc05af2e5	chore(nodes): restore offline nodes too Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 14:22:18 +00:00
Richard Palethorpe	efdcbbe332	feat(api): Return 404 when model is not found except for model names in HF format (#9133 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-31 10:48:21 +02:00
Ettore Di Giacinto	b4fff9293d	chore: small ui improvements in the node page Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:41:40 +00:00
Ettore Di Giacinto	3db12eaa7a	fix(oauth/invite): do not register user (prending approval) without correct invite (#9189 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:29:07 +02:00
Ettore Di Giacinto	8862e3ce60	feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186 ) * always enable parallel requests Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: move tests to ginkgo Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore(smart router): order by available vram Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-31 08:28:56 +02:00
Ettore Di Giacinto	dd3376e0a9	chore(workers): improve logging, set header timeouts (#9171 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-03-30 17:26:55 +02:00
Richard Palethorpe	c2f7d1c18b	feat(ui): Add media history to studio pages (e.g. past images) (#9151 ) Signed-off-by: Richard Palethorpe <io@richiejp.com>	2026-03-30 00:49:55 +02:00

1 2 3 4 5 ...

665 Commits