LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-04-16 21:08:16 -04:00

Author	SHA1	Message	Date
LocalAI [bot]	08445b1b89	chore(model-gallery): ⬆️ update checksum (#9369 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-16 01:12:01 +02:00
Ettore Di Giacinto	ad3c8c4832	fix(agents): handle embedding model dim changes on collection upload (#9365 ) Bumps LocalAGI to pick up the LocalRecall postgres backend fix that resizes the pgvector column when the configured embedding model returns vectors of a different dimensionality than the existing collection. Switching the agent pool's embedding model now triggers a transparent re-embed at startup instead of failing every subsequent upload with 'expected N dimensions, not M' (SQLSTATE 22000). Also surfaces a 409 with an actionable message in UploadToCollectionEndpoint as a safety net for the rare cases the upstream migration path doesn't cover (e.g. a model swapped at runtime), instead of the previous opaque 500.	2026-04-15 20:05:28 +02:00
Ettore Di Giacinto	6f0051301b	feat(backend): add tinygrad multimodal backend (experimental) (#9364 ) * feat(backend): add tinygrad multimodal backend Wire tinygrad as a new Python backend covering LLM text generation with native tool-call extraction, embeddings, Stable Diffusion 1.x image generation, and Whisper speech-to-text from a single self-contained container. Backend (`backend/python/tinygrad/`): - `backend.py` gRPC servicer with LLM Predict/PredictStream (auto-detects Llama / Qwen2 / Mistral architecture from `config.json`, supports safetensors and GGUF), Embedding via mean-pooled last hidden state, GenerateImage via the vendored SD1.x pipeline, AudioTranscription + AudioTranscriptionStream via the vendored Whisper inference loop, plus Tokenize / ModelMetadata / Status / Free. - Vendored upstream model code under `vendor/` (MIT, headers preserved): llama.py with an added `qkv_bias` flag for Qwen2-family bias support and an `embed()` method that returns the last hidden state, plus clip.py, unet.py, stable_diffusion.py (trimmed to drop the MLPerf training branch that pulls `mlperf.initializers`), audio_helpers.py and whisper.py (trimmed to drop the pyaudio listener). - Pluggable tool-call parsers under `tool_parsers/`: hermes (Qwen2.5 / Hermes), llama3_json (Llama 3.1+), qwen3_xml (Qwen 3), mistral (Mistral / Mixtral). Auto-selected from model architecture or `Options`. - `install.sh` pins Python 3.11.14 (tinygrad >=0.12 needs >=3.11; the default portable python is 3.10). - `package.sh` bundles libLLVM.so.1 + libedit/libtinfo/libgomp/libsndfile into the scratch image. `run.sh` sets `CPU_LLVM=1` and `LLVM_PATH` so tinygrad's CPU device uses the in-process libLLVM JIT instead of shelling out to the missing `clang` binary. - Local unit tests for Health and the four parsers in `test.py`. Build wiring: - Root `Makefile`: `.NOTPARALLEL`, `prepare-test-extra`, `test-extra`, `BACKEND_TINYGRAD = tinygrad\|python\|.\|false\|true`, docker-build-target eval, and `docker-build-backends` aggregator. - `.github/workflows/backend.yml`: cpu / cuda12 / cuda13 build matrix entries (mirrors the transformers backend placement). - `backend/index.yaml`: `&tinygrad` meta + cpu/cuda12/cuda13 image entries (latest + development). E2E test wiring: - `tests/e2e-backends/backend_test.go` gains an `image` capability that exercises GenerateImage and asserts a non-empty PNG is written to `dst`. New `BACKEND_TEST_IMAGE_PROMPT` / `BACKEND_TEST_IMAGE_STEPS` knobs. - Five new make targets next to `test-extra-backend-vllm`: - `test-extra-backend-tinygrad` — Qwen2.5-0.5B-Instruct + hermes, mirrors the vllm target 1:1 (5/9 specs in ~57s). - `test-extra-backend-tinygrad-embeddings` — same model, embeddings via LLM hidden state (3/9 in ~10s). - `test-extra-backend-tinygrad-sd` — stable-diffusion-v1-5 mirror, health/load/image (3/9 in ~10min, 4 diffusion steps on CPU). - `test-extra-backend-tinygrad-whisper` — openai/whisper-tiny.en against jfk.wav from whisper.cpp samples (4/9 in ~49s). - `test-extra-backend-tinygrad-all` aggregate. All four targets land green on the first MVP pass: 15 specs total, 0 failures across LLM+tools, embeddings, image generation, and speech transcription. * refactor(tinygrad): collapse to a single backend image tinygrad generates its own GPU kernels (PTX renderer for CUDA, the autogen ctypes wrappers for HIP / Metal / WebGPU) and never links against cuDNN, cuBLAS, or any toolkit-version-tied library. The only runtime dependency that varies across hosts is the driver's libcuda.so.1 / libamdhip64.so, which are injected into the container at run time by the nvidia-container / rocm runtimes. So unlike torch- or vLLM-based backends, there is no reason to ship per-CUDA-version images. - Drop the cuda12-tinygrad and cuda13-tinygrad build-matrix entries from .github/workflows/backend.yml. The sole remaining entry is renamed to -tinygrad (from -cpu-tinygrad) since it is no longer CPU-only. - Collapse backend/index.yaml to a single meta + development pair. The meta anchor carries the latest uri directly; the development entry points at the master tag. - run.sh picks the tinygrad device at launch time by probing /usr/lib/... for libcuda.so.1 / libamdhip64.so. When libcuda is visible we set CUDA=1 + CUDA_PTX=1 so tinygrad uses its own PTX renderer (avoids any nvrtc/toolkit dependency); otherwise we fall back to HIP or CLANG. CPU_LLVM=1 + LLVM_PATH keep the in-process libLLVM JIT for the CLANG path. - backend.py's _select_tinygrad_device() is trimmed to a CLANG-only fallback since production device selection happens in run.sh. Re-ran test-extra-backend-tinygrad after the change: Ran 5 of 9 Specs in 56.541 seconds — 5 Passed, 0 Failed	2026-04-15 19:48:23 +02:00
LocalAI [bot]	8487058673	chore(model-gallery): ⬆️ update checksum (#9358 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-15 01:25:59 +02:00
LocalAI [bot]	62862ca06b	chore: ⬆️ Update ggml-org/llama.cpp to `fae3a28070fe4026f87bd6a544aba1b2d1896566` (#9357 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-15 01:25:41 +02:00
LocalAI [bot]	07e244d869	feat(swagger): update swagger (#9356 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-15 01:25:24 +02:00
Ettore Di Giacinto	95efb8a562	feat(backend): add turboquant llama.cpp-fork backend (#9355 ) * feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the copied grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-15 01:25:04 +02:00
Ettore Di Giacinto	410d100cc3	chore(ui): improve visibility of forms, color palette Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 21:53:03 +00:00
Ettore Di Giacinto	833b7e8557	chore(docs): update transcription endpoint Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 14:14:54 +00:00
Ettore Di Giacinto	87e6de1989	feat: wire transcription for llama.cpp, add streaming support (#9353 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 16:13:40 +02:00
Ettore Di Giacinto	b361d2ddd6	chore(gallery): add new llama.cpp supported models (qwen-asr, ocr) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 10:04:50 +00:00
Ettore Di Giacinto	1e4c4577bb	fix(ci): small fixups Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 09:27:27 +00:00
dependabot[bot]	98fd9d5cc6	chore(deps): bump github.com/charmbracelet/glamour from 0.10.0 to 1.0.0 (#9340 ) Bumps [github.com/charmbracelet/glamour](https://github.com/charmbracelet/glamour) from 0.10.0 to 1.0.0. - [Release notes](https://github.com/charmbracelet/glamour/releases) - [Commits](https://github.com/charmbracelet/glamour/compare/v0.10.0...v1.0.0) --- updated-dependencies: - dependency-name: github.com/charmbracelet/glamour dependency-version: 1.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-14 11:17:05 +02:00
dependabot[bot]	0c725f5702	chore(deps): bump github.com/swaggo/echo-swagger from 1.4.1 to 1.5.2 (#9344 ) Bumps [github.com/swaggo/echo-swagger](https://github.com/swaggo/echo-swagger) from 1.4.1 to 1.5.2. - [Release notes](https://github.com/swaggo/echo-swagger/releases) - [Commits](https://github.com/swaggo/echo-swagger/compare/v1.4.1...v1.5.2) --- updated-dependencies: - dependency-name: github.com/swaggo/echo-swagger dependency-version: 1.5.2 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-14 11:15:37 +02:00
dependabot[bot]	7661a4ffa5	chore(deps): bump github.com/testcontainers/testcontainers-go/modules/nats from 0.41.0 to 0.42.0 (#9341 ) chore(deps): bump github.com/testcontainers/testcontainers-go/modules/nats Bumps [github.com/testcontainers/testcontainers-go/modules/nats](https://github.com/testcontainers/testcontainers-go) from 0.41.0 to 0.42.0. - [Release notes](https://github.com/testcontainers/testcontainers-go/releases) - [Commits](https://github.com/testcontainers/testcontainers-go/compare/v0.41.0...v0.42.0) --- updated-dependencies: - dependency-name: github.com/testcontainers/testcontainers-go/modules/nats dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-14 11:15:26 +02:00
dependabot[bot]	24ad6e4be1	chore(deps): bump github.com/google/go-containerregistry from 0.21.3 to 0.21.5 (#9343 ) chore(deps): bump github.com/google/go-containerregistry Bumps [github.com/google/go-containerregistry](https://github.com/google/go-containerregistry) from 0.21.3 to 0.21.5. - [Release notes](https://github.com/google/go-containerregistry/releases) - [Commits](https://github.com/google/go-containerregistry/compare/v0.21.3...v0.21.5) --- updated-dependencies: - dependency-name: github.com/google/go-containerregistry dependency-version: 0.21.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-14 11:15:09 +02:00
LocalAI [bot]	c0648b8836	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `55d3c05bf7b377deaa5dc84d255d9740a345a206` (#9348 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-14 08:56:25 +02:00
Ettore Di Giacinto	a05c7def59	fix(e2e): update to new testcontainers Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-14 06:56:04 +00:00
LocalAI [bot]	906acba8db	chore: ⬆️ Update ggml-org/llama.cpp to `e97492369888f5311e4d1f3beb325a36bbed70e9` (#9347 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-14 08:54:25 +02:00
dependabot[bot]	4226ca4aee	chore(deps): bump sentence-transformers from 5.2.3 to 5.4.0 in /backend/python/transformers (#9342 ) chore(deps): bump sentence-transformers in /backend/python/transformers Bumps [sentence-transformers](https://github.com/huggingface/sentence-transformers) from 5.2.3 to 5.4.0. - [Release notes](https://github.com/huggingface/sentence-transformers/releases) - [Commits](https://github.com/huggingface/sentence-transformers/compare/v5.2.3...v5.4.0) --- updated-dependencies: - dependency-name: sentence-transformers dependency-version: 5.4.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-14 00:30:27 +02:00
LocalAI [bot]	c6d5dc3374	chore(model-gallery): ⬆️ update checksum (#9346 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-13 23:00:13 +02:00
Ettore Di Giacinto	7ce675af21	chore(gallery-agent): extract readme Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-13 20:31:49 +00:00
Ettore Di Giacinto	be1b8d56c9	fix(gallery): override parameters for flux kontext Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-13 22:29:17 +02:00
dependabot[bot]	97f087ed31	chore(deps): bump github.com/testcontainers/testcontainers-go from 0.41.0 to 0.42.0 (#9338 ) chore(deps): bump github.com/testcontainers/testcontainers-go Bumps [github.com/testcontainers/testcontainers-go](https://github.com/testcontainers/testcontainers-go) from 0.41.0 to 0.42.0. - [Release notes](https://github.com/testcontainers/testcontainers-go/releases) - [Commits](https://github.com/testcontainers/testcontainers-go/compare/v0.41.0...v0.42.0) --- updated-dependencies: - dependency-name: github.com/testcontainers/testcontainers-go dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-13 21:54:02 +02:00
dependabot[bot]	8691bbe663	chore(deps): bump actions/upload-pages-artifact from 4 to 5 (#9337 ) Bumps [actions/upload-pages-artifact](https://github.com/actions/upload-pages-artifact) from 4 to 5. - [Release notes](https://github.com/actions/upload-pages-artifact/releases) - [Commits](https://github.com/actions/upload-pages-artifact/compare/v4...v5) --- updated-dependencies: - dependency-name: actions/upload-pages-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-13 21:53:47 +02:00
dependabot[bot]	7998f96f11	chore(deps): bump softprops/action-gh-release from 2 to 3 (#9336 ) Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 2 to 3. - [Release notes](https://github.com/softprops/action-gh-release/releases) - [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md) - [Commits](https://github.com/softprops/action-gh-release/compare/v2...v3) --- updated-dependencies: - dependency-name: softprops/action-gh-release dependency-version: '3' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-13 21:53:28 +02:00
Ettore Di Giacinto	cada97ee46	chore(gallery-agent): control bot via PR Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-13 19:52:48 +00:00
Ettore Di Giacinto	3375ea1a2c	chore(gallery-agent): simplify Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-13 19:50:31 +00:00
Ettore Di Giacinto	0e7c0adee4	docs: document tool calling on vLLM and MLX backends openai-functions.md used to claim LocalAI tool calling worked only on llama.cpp-compatible models. That was true when it was written; it's not true now — vLLM (since PR #9328) and MLX/MLX-VLM both extract structured tool calls from model output. - openai-functions.md: new 'Supported backends' matrix covering llama.cpp, vllm, vllm-omni, mlx, mlx-vlm, with the key distinction that vllm needs an explicit tool_parser: option while mlx auto- detects from the chat template. Reasoning content (think tags) is extracted on the same set of backends. Added setup snippets for both the vllm and mlx paths, and noted the gallery importer pre-fills tool_parser:/reasoning_parser: for known families. - compatibility-table.md: fix the stale 'Streaming: no' for vllm, vllm-omni, mlx, mlx-vlm (all four support streaming now). Add 'Functions' to their capabilities. Also widen the MLX Acceleration column to reflect the CPU/CUDA/Jetson L4T backends that already exist in backend/index.yaml — 'Metal' on its own was misleading.	2026-04-13 16:58:55 +00:00
Ettore Di Giacinto	016da02845	feat: refactor shared helpers and enhance MLX backend functionality (#9335 ) * refactor(backends): extract python_utils + add mlx_utils shared helpers Move parse_options() and messages_to_dicts() out of vllm_utils.py into a new framework-agnostic python_utils.py, and re-export them from vllm_utils so existing vllm / vllm-omni imports keep working. Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported from mlx_vlm/server.py's process_tool_calls. These work with any mlx-lm / mlx-vlm tool module (anything exposing tool_call_start, tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in later commits to emit structured ChatDelta.tool_calls without reimplementing per-model parsing. Shared smoke tests confirm: - parse_options round-trips bool/int/float/string - vllm_utils re-exports are identity-equal to python_utils originals - mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a shim module and produces a correctly-indexed list with JSON arguments - mlx_utils split_reasoning extracts <think> blocks and leaves clean content * feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs Bring the MLX backend up to the same structured-output contract as vLLM and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees tool_calls and reasoning_content, not just raw text. Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto- detects the right tool parser from the model's chat template (_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes has_tool_calling, has_thinking, tool_parser, tool_call_start, tool_call_end, think_start, think_end — no user configuration needed, unlike vLLM. Changes in backend/python/mlx/backend.py: - Imports: replace inline parse_options / messages_to_dicts with the shared helpers from python_utils. Pull split_reasoning / parse_tool_calls from the new mlx_utils shared module. - LoadModel: log the auto-detected has_tool_calling / has_thinking / tool_parser_type for observability. Drop the local is_float / is_int duplicates. - _prepare_prompt: run request.Messages through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive the conversion, and pass tools=json.loads(request.Tools) + enable_thinking=True (when request.Metadata says so) to apply_chat_template. Falls back on TypeError for tokenizers whose template doesn't accept those kwargs. - _build_generation_params: return an additional (logits_params, stop_words) pair. Maps RepetitionPenalty / PresencePenalty / FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and threads StopPrompts through to post-decode truncation. - New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop helpers. _finalize_output runs split_reasoning when has_thinking is true and parse_tool_calls (using a SimpleNamespace shim around the wrapper's tool_parser callable) when has_tool_calling is true, then extracts prompt_tokens, generation_tokens and (best-effort) logprobs from the last GenerationResponse chunk. - Predict: use make_logits_processors, accumulate text + last_response, finalize into a structured Reply carrying chat_deltas, prompt_tokens, tokens, logprobs. Early-stops on user stop sequences. - PredictStream: per-chunk Reply still carries raw message bytes for back-compat but now also emits chat_deltas=[ChatDelta(content=delta)]. On loop exit, emit a terminal Reply with structured reasoning_content / tool_calls / token counts / logprobs — so the Go side sees tool calls without needing the regex fallback. - TokenizeString RPC: uses the TokenizerWrapper's encode(); returns length + tokens or FAILED_PRECONDITION if the model isn't loaded. - Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(), calls mx.metal.clear_cache() when available, and best-effort clears torch.cuda as a belt-and-suspenders. * feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers) Same treatment as the MLX backend: emit structured Reply.chat_deltas, tool_calls, reasoning_content, token counts and logprobs, and extend sampling parameter coverage beyond the temp/top_p pair the backend used to handle. - Imports: drop the inline is_float/is_int helpers, pull parse_options / messages_to_dicts from python_utils and split_reasoning / parse_tool_calls from mlx_utils. Also import make_sampler and make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them. - LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser / load_tool_module to auto-detect a tool module from the processor's chat_template. Stash think_start / think_end / has_thinking so later finalisation can split reasoning blocks without duck-typing on each call. Logs the detected parser type. - _prepare_prompt: convert proto Messages via messages_to_dicts (so tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools) and enable_thinking=True to apply_chat_template when present, fall back on TypeError for older mlx-vlm versions. Also handle the prompt-only + media and empty-prompt + media paths consistently. - _build_generation_params: return (max_tokens, sampler_params, logits_params, stop_words). Maps repetition_penalty / presence_penalty / frequency_penalty and passes them through make_logits_processors. - _finalize_output / _truncate_at_stop: common helper used by Predict and PredictStream to split reasoning, run parse_tool_calls against the auto-detected tool module, build ToolCallDelta list, and extract token counts + logprobs from the last GenerationResult. - Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate in both paths, accumulate text + last_response, pass sampler and logits_processors through, emit content-only ChatDelta per streaming chunk followed by a terminal Reply carrying reasoning_content, tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict returns the same structured Reply shape. - New helper _collect_media extracted from the duplicated base64 image / audio decode loop. - New TokenizeString RPC using the processor's tokenizer.encode and Free RPC that drops model/processor/config, runs gc + Metal cache clear + best-effort torch.cuda cache clear. * feat(importer/mlx): auto-set tool_parser/reasoning_parser on import Mirror what core/gallery/importers/vllm.go does: after applying the shared inference defaults, look up the model URI in parser_defaults.json and append matching tool_parser:/reasoning_parser: entries to Options. The MLX backends auto-detect tool parsers from the chat template at runtime so they don't actually consume these options — but surfacing them in the generated YAML: - keeps the import experience consistent with vllm - gives users a single visible place to override - documents the intended parser for a given model family * test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets - backend/python/mlx/test.py: add TestSharedHelpers with server-less unit tests for parse_options, messages_to_dicts, split_reasoning and parse_tool_calls (using a SimpleNamespace shim to fake a tool module without requiring a model). Plus test_tokenize_string and test_free RPC tests that load a tiny MLX-quantized Llama and exercise the new RPCs end-to-end. - backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of the duplicated import block at the top of the file. - Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing from the docker-build-target eval list — only mlx-distributed had a generated target before). Add test-extra-backend-mlx and test-extra-backend-mlx-vlm convenience targets that build the respective image and run tests/e2e-backends with the tools capability against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend auto-detects the tool parser from the chat template so no BACKEND_TEST_OPTIONS is needed (unlike vllm). * fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true backend/python/common/libbackend.sh:ensureVenv() always invoked 'python -m venv --copies', but macOS system python (and some other builds) refuses with: Error: This build of python cannot create venvs without using symlinks --copies only matters when _makeVenvPortable later relocates the venv, which only happens when PORTABLE_PYTHON=true. Make --copies conditional on that flag and fall back to default (symlinked) venv otherwise. Caught while bringing up the mlx backend on Apple Silicon — the same build path is used by every Python backend with USE_PIP=true. * fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache The released mlx-lm 0.29.x ships a much simpler tool-calling API than HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers from the tokenizer vocab and exposes has_tool_calling / tool_call_start / tool_call_end, but does NOT expose a tool_parser callable on the wrapper and does NOT ship a mlx_lm.tool_parsers subpackage at all (those only exist on main). Caught while running the smoke test on Apple Silicon with the released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError (falling through to the underlying HF tokenizer), so _tool_module_from_tokenizer always returned None and tool calls slipped through as raw <tool_call>...</tool_call> text in Reply.message instead of being parsed into ChatDelta.tool_calls. Fix: when has_tool_calling is True but tokenizer.tool_parser is missing, default the parse_tool_call callable to json.loads(body.strip()) — that's exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD and covers the only format 0.29 detects (<tool_call>JSON</tool_call>). Future mlx-lm releases that ship more parsers will be picked up automatically via the tokenizer.tool_parser attribute when present. Also tighten the LoadModel logging — the old log line read init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and showed None even when has_tool_calling was True. Log the actual tool_call_start / tool_call_end markers instead. While here, switch Free()'s Metal cache clear from the deprecated mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a fallback for older releases. Mirrored to the mlx-vlm backend. * feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler) Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas with structured tool_calls / reasoning_content / token counts / logprobs, expand sampling parameter coverage beyond temp+top_p, and add the missing TokenizeString and Free RPCs. Notes specific to mlx-distributed: - Rank 0 is the only rank that owns a sampler — workers participate in the pipeline-parallel forward pass via mx.distributed and don't re-implement sampling. So the new logits_params (repetition_penalty, presence_penalty, frequency_penalty) and stop_words apply on rank 0 only; we don't need to extend coordinator.broadcast_generation_params, which still ships only max_tokens / temperature / top_p to workers (everything else is a rank-0 concern). - Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is active, so they release the model on their end too. The constant is already defined and handled by the existing worker loop in backend.py:633 (CMD_SHUTDOWN = -1). - Drop the locally-defined is_float / is_int / parse_options trio in favor of python_utils.parse_options, re-exported under the module name for back-compat with anything that imported it directly. - _prepare_prompt: route through messages_to_dicts so tool_call_id / tool_calls / reasoning_content survive, pass tools=json.loads( request.Tools) and enable_thinking=True to apply_chat_template, fall back on TypeError for templates that don't accept those kwargs. - New _tool_module_from_tokenizer (with the json.loads fallback for mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same contract as the mlx backend. - LoadModel logs the auto-detected has_tool_calling / has_thinking / tool_call_start / tool_call_end so users can see what the wrapper picked up for the loaded model. - backend/python/mlx-distributed/test.py: add the same TestSharedHelpers unit tests (parse_options, messages_to_dicts, split_reasoning, parse_tool_calls) that exist for mlx and mlx-vlm.	2026-04-13 18:44:03 +02:00
Ettore Di Giacinto	daa0272f2e	docs(agents): capture vllm backend lessons + runtime lib packaging (#9333 ) New .agents/vllm-backend.md with everything that's easy to get wrong on the vllm/vllm-omni backends: - Use vLLM's native ToolParserManager / ReasoningParserManager — do not write regex-based parsers. Selection is explicit via Options[], defaults live in core/config/parser_defaults.json. - Concrete parsers don't always accept the tools= kwarg the abstract base declares; try/except TypeError is mandatory. - ChatDelta.tool_calls is the contract — Reply.message text alone won't surface tool calls in /v1/chat/completions. - vllm version pin trap: 0.14.1+cpu pairs with torch 2.9.1+cpu. Newer wheels declare torch==2.10.0+cpu which only exists on the PyTorch test channel and pulls an incompatible torchvision. - SIMD baseline: prebuilt wheel needs AVX-512 VNNI/BF16. SIGILL symptom + FROM_SOURCE=true escape hatch are documented. - libnuma.so.1 + libgomp.so.1 must be bundled because vllm._C silently fails to register torch ops if they're missing. - backend_hooks system: hooks_llamacpp / hooks_vllm split + the '*' / '' / named-backend keys. - ToProto() must serialize ToolCallID and Reasoning — easy to miss when adding fields to schema.Message. Also extended .agents/adding-backends.md with a generic 'Bundling runtime shared libraries' section: Dockerfile.python is FROM scratch, package.sh is the mechanism, libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH, and how to verify packaging without trusting the host (extract image, boot in fresh ubuntu container). Index in AGENTS.md updated.	2026-04-13 11:09:57 +02:00
Ettore Di Giacinto	d67623230f	feat(vllm): parity with llama.cpp backend (#9328 ) * fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto The ToProto conversion was dropping tool_call_id and reasoning_content even though both proto and Go fields existed, breaking multi-turn tool calling and reasoning passthrough to backends. * refactor(config): introduce backend hook system and migrate llama-cpp defaults Adds RegisterBackendHook/runBackendHooks so each backend can register default-filling functions that run during ModelConfig.SetDefaults(). Migrates the existing GGUF guessing logic into hooks_llamacpp.go, registered for both 'llama-cpp' and the empty backend (auto-detect). Removes the old guesser.go shim. * feat(config): add vLLM parser defaults hook and importer auto-detection Introduces parser_defaults.json mapping model families to vLLM tool_parser/reasoning_parser names, with longest-pattern-first matching. The vllmDefaults hook auto-fills tool_parser and reasoning_parser options at load time for known families, while the VLLMImporter writes the same values into generated YAML so users can review and edit them. Adds tests covering MatchParserDefaults, hook registration via SetDefaults, and the user-override behavior. * feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs - Use vLLM's ToolParserManager/ReasoningParserManager to extract structured output (tool calls, reasoning content) instead of reimplementing parsing - Convert proto Messages to dicts and pass tools to apply_chat_template - Emit ChatDelta with content/reasoning_content/tool_calls in Reply - Extract prompt_tokens, completion_tokens, and logprobs from output - Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar - Add TokenizeString and Free RPC methods - Fix missing `time` import used by load_video() * feat(vllm): CPU support + shared utils + vllm-omni feature parity - Split vllm install per acceleration: move generic `vllm` out of requirements-after.txt into per-profile after files (cublas12, hipblas, intel) and add CPU wheel URL for cpu-after.txt - requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index - backend/index.yaml: register cpu-vllm / cpu-vllm-development variants - New backend/python/common/vllm_utils.py: shared parse_options, messages_to_dicts, setup_parsers helpers (used by both vllm backends) - vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template, wire native parsers via shared utils, emit ChatDelta with token counts, add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE - Add test_cpu_inference.py: standalone script to validate CPU build with a small model (Qwen2.5-0.5B-Instruct) * fix(vllm): CPU build compatibility with vllm 0.14.1 Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict, TokenizeString, Free all working). - requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU wheel whose torch dependency resolves against published PyTorch builds (torch==2.9.1+cpu). Later vllm CPU wheels currently require torch==2.10.0+cpu which is only available on the PyTorch test channel with incompatible torchvision. - requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio so uv resolves them consistently from the PyTorch CPU index. - install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv can mix the PyTorch index and PyPI for transitive deps (matches the existing intel profile behaviour). - backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config so the old code path errored out with AttributeError on model load. Switch to the new get_tokenizer()/tokenizer accessor with a fallback to building the tokenizer directly from request.Model. * fix(vllm): tool parser constructor compat + e2e tool calling test Concrete vLLM tool parsers override the abstract base's __init__ and drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer). Instantiating with tools= raised TypeError which was silently caught, leaving chat_deltas.tool_calls empty. Retry the constructor without the tools kwarg on TypeError — tools aren't required by these parsers since extract_tool_calls finds tool syntax in the raw model output directly. Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU: the backend correctly returns ToolCallDelta{name='get_weather', arguments='{"location": "Paris, France"}'} in ChatDelta. test_tool_calls.py is a standalone smoke test that spawns the gRPC backend, sends a chat completion with tools, and asserts the response contains a structured tool call. * ci(backend): build cpu-vllm container image Add the cpu-vllm variant to the backend container build matrix so the image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development) is actually produced by CI. Follows the same pattern as the other CPU python backends (cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA. backend_pr.yml auto-picks this up via its matrix filter from backend.yml. * test(e2e-backends): add tools capability + HF model name support Extends tests/e2e-backends to cover backends that: - Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as ModelOptions.Model with no download/ModelFile. - Parse tool calls into ChatDelta.tool_calls: new "tools" capability sends a Predict with a get_weather function definition and asserts the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate with OpenAI-style Messages so the backend can wire tools into the model's chat template. - Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time. Adds make target test-extra-backend-vllm that: - docker-build-vllm - loads Qwen/Qwen2.5-0.5B-Instruct - runs health,load,predict,stream,tools with tool_parser:hermes Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those standalone scripts were scaffolding used while bringing up the Python backend; the e2e-backends harness now covers the same ground uniformly alongside llama-cpp and ik-llama-cpp. * ci(test-extra): run vllm e2e tests on CPU Adds tests-vllm-grpc to the test-extra workflow, mirroring the llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under backend/python/vllm/ change (or on run-all), builds the local-ai vllm container image, and runs the tests/e2e-backends harness with BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes, and the tools capability enabled. Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm wheel we pinned in requirements-cpu-after.txt. Frees disk space before the build since the docker image + torch + vllm wheel is sizeable. * fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns the model_executor.models.registry subprocess for introspection, so LoadModel never reaches the actual inference path. - install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide requirements-cpu-after.txt so installRequirements installs the base deps + torch CPU without pulling the prebuilt wheel, then clone vllm and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries target the host's actual CPU. - backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose it as an ENV so install.sh sees it during `make`. - Makefile docker-build-backend: forward FROM_SOURCE as --build-arg when set, so backends that need source builds can opt in. - Makefile test-extra-backend-vllm: call docker-build-vllm via a recursive $(MAKE) invocation so FROM_SOURCE flows through. - .github/workflows/test-extra.yml: set FROM_SOURCE=true on the tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only works on hosts that share the build-time SIMD baseline. Answers 'did you test locally?': yes, end-to-end on my local machine with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU gap was not covered locally — this commit plugs that gap. * ci(vllm): use bigger-runner instead of source build The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512 VNNI/BF16) that stock ubuntu-latest GitHub runners don't support — vllm.model_executor.models.registry SIGILLs on import during LoadModel. Source compilation works but takes 30-40 minutes per CI run, which is too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the bigger-runner self-hosted label (already used by backend.yml for the llama-cpp CUDA build) — that hardware has the required SIMD baseline and the prebuilt wheel runs cleanly. FROM_SOURCE=true is kept as an opt-in escape hatch: - install.sh still has the CPU source-build path for hosts that need it - backend/Dockerfile.python still declares the ARG + ENV - Makefile docker-build-backend still forwards the build-arg when set Default CI path uses the fast prebuilt wheel; source build can be re-enabled by exporting FROM_SOURCE=true in the environment. * ci(vllm): install make + build deps on bigger-runner bigger-runner is a bare self-hosted runner used by backend.yml for docker image builds — it has docker but not the usual ubuntu-latest toolchain. The make-based test target needs make, build-essential (cgo in 'go test'), and curl/unzip (the Makefile protoc target downloads protoc from github releases). protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the install-go-tools target, which setup-go makes possible. * ci(vllm): install libnuma1 + libgomp1 on bigger-runner The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens libnuma.so.1 at import time. When the runner host doesn't have it, the extension silently fails to register its torch ops, so EngineCore crashes on init_device with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be safe on stripped-down runners. * feat(vllm): bundle libnuma/libgomp via package.sh The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP). Without these on the host, vllm._C silently fails to register its torch ops and EngineCore crashes with: AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' Rather than asking every user to install libnuma1/libgomp1 on their host (or every LocalAI base image to ship them), bundle them into the backend image itself — same pattern fish-speech and the GPU libs already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at run time so the bundled copies are picked up automatically. - backend/python/vllm/package.sh (new): copies libnuma.so.1 and libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib, preserving soname symlinks. Runs during Dockerfile.python's 'Run backend-specific packaging' step (which already invokes package.sh if present). - backend/Dockerfile.python: install libnuma1 + libgomp1 in the builder stage so package.sh has something to copy (the Ubuntu base image otherwise only has libgomp in the gcc dep chain). - test-extra.yml: drop the workaround that installed these libs on the runner host — with the backend image self-contained, the runner no longer needs them, and the test now exercises the packaging path end-to-end the way a production host would. * ci(vllm): disable tests-vllm-grpc job (heterogeneous runners) Both ubuntu-latest and bigger-runner have inconsistent CPU baselines: some instances support the AVX-512 VNNI/BF16 instructions the prebuilt vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of vllm.model_executor.models.registry. The libnuma packaging fix doesn't help when the wheel itself can't be loaded. FROM_SOURCE=true compiles vllm against the actual host CPU and works everywhere, but takes 30-50 minutes per run — too slow for a smoke test on every PR. Comment out the job for now. The test itself is intact and passes locally; run it via 'make test-extra-backend-vllm' on a host with the required SIMD baseline. Re-enable when: - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or - vllm publishes a CPU wheel with a wider baseline, or - we set up a docker layer cache that makes FROM_SOURCE acceptable The detect-changes vllm output, the test harness changes (tests/ e2e-backends + tools cap), the make target (test-extra-backend-vllm), the package.sh and the Dockerfile/install.sh plumbing all stay in place.	2026-04-13 11:00:29 +02:00
LocalAI [bot]	0f90d17aac	feat(swagger): update swagger (#9329 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-13 09:42:36 +02:00
LocalAI [bot]	ea32b8953f	chore: ⬆️ Update ggml-org/llama.cpp to `1e9d771e2c2f1113a5ebdd0dc15bafe57dce64be` (#9330 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-13 09:42:18 +02:00
Ettore Di Giacinto	bc7578bdb1	fix(hipblas): pin down rocm6.4 wheels on whisperx (7.x not supported) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-12 15:27:51 +00:00
Ettore Di Giacinto	9ca03cf9cc	feat(backends): add ik-llama-cpp (#9326 ) * feat(backends): add ik-llama-cpp Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * chore: add grpc e2e suite, hook to CI, update README Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>	2026-04-12 13:51:28 +02:00
Ettore Di Giacinto	151ad271f2	feat(rocm): bump to 7.x (#9323 ) feat(rocm): bump to 7.2.1 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-12 08:51:30 +02:00
Ettore Di Giacinto	2865f0f8d3	feat(ux): backend management enhancement (#9325 ) * feat: add PreferDevelopmentBackends setting, expose isMeta/isDevelopment in API - Add PreferDevelopmentBackends config field, CLI flag, runtime setting - Add IsDevelopment() method to GalleryBackend - Use AvailableBackendsUnfiltered in UI API to show all backends - Expose isMeta, isDevelopment, preferDevelopmentBackends in backend API response * feat: upgrade banner with Upgrade All button, detect pre-existing backends - Add upgrade banner on Backends page showing count and Upgrade All button - Fix upgrade detection for backends installed before version tracking: flag as upgradeable when gallery has a version but installed has none - Fix OCI digest check to flag backends with no stored digest as upgradeable	2026-04-12 00:35:22 +02:00
LocalAI [bot]	6fbda277c5	chore: ⬆️ Update ggml-org/llama.cpp to `ff5ef8278615a2462b79b50abdf3cc95cfb31c6f` (#9319 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 23:15:23 +02:00
Ettore Di Giacinto	7a0e6ae6d2	feat(qwen3tts.cpp): add new backend (#9316 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-11 23:14:26 +02:00
LocalAI [bot]	e4bfc42a2d	chore: ⬆️ Update leejet/stable-diffusion.cpp to `6b675a5ede9b0edf0a0f44191e8b79d7ef27615a` (#9320 ) ⬆️ Update leejet/stable-diffusion.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 23:07:30 +02:00
LocalAI [bot]	7edd3ea96f	chore(model-gallery): ⬆️ update checksum (#9321 ) ⬆️ Checksum updates in gallery/index.yaml Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 22:53:48 +02:00
LocalAI [bot]	b20a2f1cea	feat(swagger): update swagger (#9318 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 22:31:36 +02:00
Ettore Di Giacinto	8ab0744458	feat: backend versioning, upgrade detection and auto-upgrade (#9315 ) * feat: add backend versioning data model foundation Add Version, URI, and Digest fields to BackendMetadata for tracking installed backend versions and enabling upgrade detection. Add Version field to GalleryBackend. Add UpgradeAvailable/AvailableVersion fields to SystemBackend. Implement GetImageDigest() for lightweight OCI digest lookups via remote.Head. Record version, URI, and digest at install time in InstallBackend() and propagate version through meta backends. * feat: add backend upgrade detection and execution logic Add CheckBackendUpgrades() to compare installed backend versions/digests against gallery entries, and UpgradeBackend() to perform atomic upgrades with backup-based rollback on failure. Includes Agent A's data model changes (Version/URI/Digest fields, GetImageDigest). * feat: add AutoUpgradeBackends config and runtime settings Add configuration and runtime settings for backend auto-upgrade: - RuntimeSettings field for dynamic config via API/JSON - ApplicationConfig field, option func, and roundtrip conversion - CLI flag with LOCALAI_AUTO_UPGRADE_BACKENDS env var - Config file watcher support for runtime_settings.json - Tests for ToRuntimeSettings, ApplyRuntimeSettings, and roundtrip * feat(ui): add backend version display and upgrade support - Add upgrade check/trigger API endpoints to config and api module - Backends page: version badge, upgrade indicator, upgrade button - Manage page: version in metadata, context-aware upgrade/reinstall button - Settings page: auto-upgrade backends toggle * feat: add upgrade checker service, API endpoints, and CLI command - UpgradeChecker background service: checks every 6h, auto-upgrades when enabled - API endpoints: GET /backends/upgrades, POST /backends/upgrades/check, POST /backends/upgrade/:name - CLI: `localai backends upgrade` command, version display in `backends list` - BackendManager interface: add UpgradeBackend and CheckUpgrades methods - Wire upgrade op through GalleryService backend handler - Distributed mode: fan-out upgrade to worker nodes via NATS * fix: use advisory lock for upgrade checker in distributed mode In distributed mode with multiple frontend instances, use PostgreSQL advisory lock (KeyBackendUpgradeCheck) so only one instance runs periodic upgrade checks and auto-upgrades. Prevents duplicate upgrade operations across replicas. Standalone mode is unchanged (simple ticker loop). * test: add e2e tests for backend upgrade API - Test GET /api/backends/upgrades returns 200 (even with no upgrade checker) - Test POST /api/backends/upgrade/:name accepts request and returns job ID - Test full upgrade flow: trigger upgrade via API, wait for job completion, verify run.sh updated to v2 and metadata.json has version 2.0.0 - Test POST /api/backends/upgrades/check returns 200 - Fix nil check for applicationInstance in upgrade API routes	2026-04-11 22:31:15 +02:00
thelittlefireman	7c1865b307	Fix load of z-image-turbo (#9264 ) * Fix load of z-image and improve speed Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com> * Remove diffusion_flash_attn from z-image-ggml.yaml Removed 'diffusion_flash_attn' parameter from configuration. Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com> --------- Signed-off-by: thelittlefireman <5165783+thelittlefireman@users.noreply.github.com>	2026-04-11 08:42:13 +02:00
LocalAI [bot]	62a674ce12	chore: ⬆️ Update ggml-org/llama.cpp to `e62fa13c2497b2cd1958cb496e9489e86bbd5182` (#9312 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 08:39:10 +02:00
LocalAI [bot]	c39213443b	feat(swagger): update swagger (#9310 ) Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 08:38:55 +02:00
LocalAI [bot]	606f462da4	chore: ⬆️ Update PABannier/sam3.cpp to `01832ef85fcc8eb6488f1d01cd247f07e96ff5a9` (#9311 ) ⬆️ Update PABannier/sam3.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-04-11 08:38:30 +02:00
Ettore Di Giacinto	5c35e85fe2	feat: allow to pin models and skip from reaping (#9309 ) Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-11 08:38:17 +02:00
Leigh Phillips	062e0d0d00	feat: Add toggle mechanism to enable/disable models from loading on demand (#9304 ) * feat: add toggle mechanism to enable/disable models from loading on demand Implements #9303 - Adds ability to disable models from being auto-loaded while keeping them in the collection. Backend changes: - Add Disabled field to ModelConfig struct with IsDisabled() getter - New ToggleModelEndpoint handler (PUT /models/toggle/:name/:action) - Request middleware returns 403 when disabled model is requested - Capabilities endpoint exposes disabled status Frontend changes: - Toggle switch in System > Models table Actions column - Visual indicators: dimmed row, red Disabled badge, muted icons - Tooltip describes toggle function on hover - Loading state while API call is in progress * fix: remove extra closing brace causing syntax error in request middleware * refactor: reorder Actions column - Stop button before toggle switch * refactor: migrate from toggle to toggle-state per PR review feedback	2026-04-10 18:17:41 +02:00

1 2 3 4 5 ...

6033 Commits