Commit Graph

665 Commits

Author SHA1 Message Date
dependabot[bot]
efae3fd97b chore(deps): bump dompurify
Bumps the npm_and_yarn group with 1 update in the /core/http/react-ui directory: [dompurify](https://github.com/cure53/DOMPurify).


Updates `dompurify` from 3.3.2 to 3.4.0
- [Release notes](https://github.com/cure53/DOMPurify/releases)
- [Commits](https://github.com/cure53/DOMPurify/compare/3.3.2...3.4.0)

---
updated-dependencies:
- dependency-name: dompurify
  dependency-version: 3.4.0
  dependency-type: direct:production
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-16 06:24:27 +00:00
dependabot[bot]
ab326a9c61 chore(deps): bump the npm_and_yarn group across 1 directory with 6 updates (#9373)
Bumps the npm_and_yarn group with 6 updates in the /core/http/react-ui directory:

| Package | From | To |
| --- | --- | --- |
| [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite) | `6.4.1` | `6.4.2` |
| [@hono/node-server](https://github.com/honojs/node-server) | `1.19.11` | `1.19.14` |
| [flatted](https://github.com/WebReflection/flatted) | `3.3.4` | `3.4.2` |
| [hono](https://github.com/honojs/hono) | `4.12.7` | `4.12.14` |
| [path-to-regexp](https://github.com/pillarjs/path-to-regexp) | `8.3.0` | `8.4.2` |
| [picomatch](https://github.com/micromatch/picomatch) | `4.0.3` | `4.0.4` |



Updates `vite` from 6.4.1 to 6.4.2
- [Release notes](https://github.com/vitejs/vite/releases)
- [Changelog](https://github.com/vitejs/vite/blob/v6.4.2/packages/vite/CHANGELOG.md)
- [Commits](https://github.com/vitejs/vite/commits/v6.4.2/packages/vite)

Updates `@hono/node-server` from 1.19.11 to 1.19.14
- [Release notes](https://github.com/honojs/node-server/releases)
- [Commits](https://github.com/honojs/node-server/compare/v1.19.11...v1.19.14)

Updates `flatted` from 3.3.4 to 3.4.2
- [Commits](https://github.com/WebReflection/flatted/compare/v3.3.4...v3.4.2)

Updates `hono` from 4.12.7 to 4.12.14
- [Release notes](https://github.com/honojs/hono/releases)
- [Commits](https://github.com/honojs/hono/compare/v4.12.7...v4.12.14)

Updates `path-to-regexp` from 8.3.0 to 8.4.2
- [Release notes](https://github.com/pillarjs/path-to-regexp/releases)
- [Changelog](https://github.com/pillarjs/path-to-regexp/blob/master/History.md)
- [Commits](https://github.com/pillarjs/path-to-regexp/compare/v8.3.0...v8.4.2)

Updates `picomatch` from 4.0.3 to 4.0.4
- [Release notes](https://github.com/micromatch/picomatch/releases)
- [Changelog](https://github.com/micromatch/picomatch/blob/master/CHANGELOG.md)
- [Commits](https://github.com/micromatch/picomatch/compare/4.0.3...4.0.4)

---
updated-dependencies:
- dependency-name: vite
  dependency-version: 6.4.2
  dependency-type: direct:development
  dependency-group: npm_and_yarn
- dependency-name: "@hono/node-server"
  dependency-version: 1.19.14
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: flatted
  dependency-version: 3.4.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: hono
  dependency-version: 4.12.14
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: path-to-regexp
  dependency-version: 8.4.2
  dependency-type: indirect
  dependency-group: npm_and_yarn
- dependency-name: picomatch
  dependency-version: 4.0.4
  dependency-type: indirect
  dependency-group: npm_and_yarn
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-16 08:23:03 +02:00
Ettore Di Giacinto
ad3c8c4832 fix(agents): handle embedding model dim changes on collection upload (#9365)
Bumps LocalAGI to pick up the LocalRecall postgres backend fix that
resizes the pgvector column when the configured embedding model
returns vectors of a different dimensionality than the existing
collection. Switching the agent pool's embedding model now triggers
a transparent re-embed at startup instead of failing every subsequent
upload with 'expected N dimensions, not M' (SQLSTATE 22000).

Also surfaces a 409 with an actionable message in
UploadToCollectionEndpoint as a safety net for the rare cases the
upstream migration path doesn't cover (e.g. a model swapped at
runtime), instead of the previous opaque 500.
2026-04-15 20:05:28 +02:00
Ettore Di Giacinto
410d100cc3 chore(ui): improve visibility of forms, color palette
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-14 21:53:03 +00:00
Ettore Di Giacinto
87e6de1989 feat: wire transcription for llama.cpp, add streaming support (#9353)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-14 16:13:40 +02:00
Ettore Di Giacinto
016da02845 feat: refactor shared helpers and enhance MLX backend functionality (#9335)
* refactor(backends): extract python_utils + add mlx_utils shared helpers

Move parse_options() and messages_to_dicts() out of vllm_utils.py into a
new framework-agnostic python_utils.py, and re-export them from vllm_utils
so existing vllm / vllm-omni imports keep working.

Add mlx_utils.py with split_reasoning() and parse_tool_calls() — ported
from mlx_vlm/server.py's process_tool_calls. These work with any
mlx-lm / mlx-vlm tool module (anything exposing tool_call_start,
tool_call_end, parse_tool_call). Used by the mlx and mlx-vlm backends in
later commits to emit structured ChatDelta.tool_calls without
reimplementing per-model parsing.

Shared smoke tests confirm:
- parse_options round-trips bool/int/float/string
- vllm_utils re-exports are identity-equal to python_utils originals
- mlx_utils parse_tool_calls handles <tool_call>...</tool_call> with a
  shim module and produces a correctly-indexed list with JSON arguments
- mlx_utils split_reasoning extracts <think> blocks and leaves clean
  content

* feat(mlx): wire native tool parsers + ChatDelta + token usage + logprobs

Bring the MLX backend up to the same structured-output contract as vLLM
and llama.cpp: emit Reply.chat_deltas so the OpenAI HTTP layer sees
tool_calls and reasoning_content, not just raw text.

Key insight: mlx_lm.load() returns a TokenizerWrapper that already auto-
detects the right tool parser from the model's chat template
(_infer_tool_parser in mlx_lm/tokenizer_utils.py). The wrapper exposes
has_tool_calling, has_thinking, tool_parser, tool_call_start,
tool_call_end, think_start, think_end — no user configuration needed,
unlike vLLM.

Changes in backend/python/mlx/backend.py:

- Imports: replace inline parse_options / messages_to_dicts with the
  shared helpers from python_utils. Pull split_reasoning / parse_tool_calls
  from the new mlx_utils shared module.
- LoadModel: log the auto-detected has_tool_calling / has_thinking /
  tool_parser_type for observability. Drop the local is_float / is_int
  duplicates.
- _prepare_prompt: run request.Messages through messages_to_dicts so
  tool_call_id / tool_calls / reasoning_content survive the conversion,
  and pass tools=json.loads(request.Tools) + enable_thinking=True (when
  request.Metadata says so) to apply_chat_template. Falls back on
  TypeError for tokenizers whose template doesn't accept those kwargs.
- _build_generation_params: return an additional (logits_params,
  stop_words) pair. Maps RepetitionPenalty / PresencePenalty /
  FrequencyPenalty to mlx_lm.sample_utils.make_logits_processors and
  threads StopPrompts through to post-decode truncation.
- New _tool_module_from_tokenizer / _finalize_output / _truncate_at_stop
  helpers. _finalize_output runs split_reasoning when has_thinking is
  true and parse_tool_calls (using a SimpleNamespace shim around the
  wrapper's tool_parser callable) when has_tool_calling is true, then
  extracts prompt_tokens, generation_tokens and (best-effort) logprobs
  from the last GenerationResponse chunk.
- Predict: use make_logits_processors, accumulate text + last_response,
  finalize into a structured Reply carrying chat_deltas,
  prompt_tokens, tokens, logprobs. Early-stops on user stop sequences.
- PredictStream: per-chunk Reply still carries raw message bytes for
  back-compat but now also emits chat_deltas=[ChatDelta(content=delta)].
  On loop exit, emit a terminal Reply with structured
  reasoning_content / tool_calls / token counts / logprobs — so the Go
  side sees tool calls without needing the regex fallback.
- TokenizeString RPC: uses the TokenizerWrapper's encode(); returns
  length + tokens or FAILED_PRECONDITION if the model isn't loaded.
- Free RPC: drops model / tokenizer / lru_cache, runs gc.collect(),
  calls mx.metal.clear_cache() when available, and best-effort clears
  torch.cuda as a belt-and-suspenders.

* feat(mlx-vlm): mirror MLX parity (tool parsers + ChatDelta + samplers)

Same treatment as the MLX backend: emit structured Reply.chat_deltas,
tool_calls, reasoning_content, token counts and logprobs, and extend
sampling parameter coverage beyond the temp/top_p pair the backend
used to handle.

- Imports: drop the inline is_float/is_int helpers, pull parse_options /
  messages_to_dicts from python_utils and split_reasoning /
  parse_tool_calls from mlx_utils. Also import make_sampler and
  make_logits_processors from mlx_lm.sample_utils — mlx-vlm re-uses them.
- LoadModel: use parse_options; call mlx_vlm.tool_parsers._infer_tool_parser
  / load_tool_module to auto-detect a tool module from the processor's
  chat_template. Stash think_start / think_end / has_thinking so later
  finalisation can split reasoning blocks without duck-typing on each
  call. Logs the detected parser type.
- _prepare_prompt: convert proto Messages via messages_to_dicts (so
  tool_call_id / tool_calls survive), pass tools=json.loads(request.Tools)
  and enable_thinking=True to apply_chat_template when present, fall
  back on TypeError for older mlx-vlm versions. Also handle the
  prompt-only + media and empty-prompt + media paths consistently.
- _build_generation_params: return (max_tokens, sampler_params,
  logits_params, stop_words). Maps repetition_penalty / presence_penalty /
  frequency_penalty and passes them through make_logits_processors.
- _finalize_output / _truncate_at_stop: common helper used by Predict
  and PredictStream to split reasoning, run parse_tool_calls against the
  auto-detected tool module, build ToolCallDelta list, and extract token
  counts + logprobs from the last GenerationResult.
- Predict / PredictStream: switch from mlx_vlm.generate to mlx_vlm.stream_generate
  in both paths, accumulate text + last_response, pass sampler and
  logits_processors through, emit content-only ChatDelta per streaming
  chunk followed by a terminal Reply carrying reasoning_content,
  tool_calls, prompt_tokens, tokens and logprobs. Non-streaming Predict
  returns the same structured Reply shape.
- New helper _collect_media extracted from the duplicated base64 image /
  audio decode loop.
- New TokenizeString RPC using the processor's tokenizer.encode and
  Free RPC that drops model/processor/config, runs gc + Metal cache
  clear + best-effort torch.cuda cache clear.

* feat(importer/mlx): auto-set tool_parser/reasoning_parser on import

Mirror what core/gallery/importers/vllm.go does: after applying the
shared inference defaults, look up the model URI in parser_defaults.json
and append matching tool_parser:/reasoning_parser: entries to Options.

The MLX backends auto-detect tool parsers from the chat template at
runtime so they don't actually consume these options — but surfacing
them in the generated YAML:
  - keeps the import experience consistent with vllm
  - gives users a single visible place to override
  - documents the intended parser for a given model family

* test(mlx): add helper unit tests + TokenizeString/Free + e2e make targets

- backend/python/mlx/test.py: add TestSharedHelpers with server-less
  unit tests for parse_options, messages_to_dicts, split_reasoning and
  parse_tool_calls (using a SimpleNamespace shim to fake a tool module
  without requiring a model). Plus test_tokenize_string and test_free
  RPC tests that load a tiny MLX-quantized Llama and exercise the new
  RPCs end-to-end.

- backend/python/mlx-vlm/test.py: same helper unit tests + cleanup of
  the duplicated import block at the top of the file.

- Makefile: register BACKEND_MLX and BACKEND_MLX_VLM (they were missing
  from the docker-build-target eval list — only mlx-distributed had a
  generated target before). Add test-extra-backend-mlx and
  test-extra-backend-mlx-vlm convenience targets that build the
  respective image and run tests/e2e-backends with the tools capability
  against mlx-community/Qwen2.5-0.5B-Instruct-4bit. The MLX backend
  auto-detects the tool parser from the chat template so no
  BACKEND_TEST_OPTIONS is needed (unlike vllm).

* fix(libbackend): don't pass --copies to venv unless PORTABLE_PYTHON=true

backend/python/common/libbackend.sh:ensureVenv() always invoked
'python -m venv --copies', but macOS system python (and some other
builds) refuses with:

    Error: This build of python cannot create venvs without using symlinks

--copies only matters when _makeVenvPortable later relocates the venv,
which only happens when PORTABLE_PYTHON=true. Make --copies conditional
on that flag and fall back to default (symlinked) venv otherwise.

Caught while bringing up the mlx backend on Apple Silicon — the same
build path is used by every Python backend with USE_PIP=true.

* fix(mlx): support mlx-lm 0.29.x tool calling + drop deprecated clear_cache

The released mlx-lm 0.29.x ships a much simpler tool-calling API than
HEAD: TokenizerWrapper detects the <tool_call>...</tool_call> markers
from the tokenizer vocab and exposes has_tool_calling /
tool_call_start / tool_call_end, but does NOT expose a tool_parser
callable on the wrapper and does NOT ship a mlx_lm.tool_parsers
subpackage at all (those only exist on main).

Caught while running the smoke test on Apple Silicon with the
released mlx-lm 0.29.1: tokenizer.tool_parser raised AttributeError
(falling through to the underlying HF tokenizer), so
_tool_module_from_tokenizer always returned None and tool calls slipped
through as raw <tool_call>...</tool_call> text in Reply.message instead
of being parsed into ChatDelta.tool_calls.

Fix: when has_tool_calling is True but tokenizer.tool_parser is missing,
default the parse_tool_call callable to json.loads(body.strip()) — that's
exactly what mlx_lm.tool_parsers.json_tools.parse_tool_call does on HEAD
and covers the only format 0.29 detects (<tool_call>JSON</tool_call>).
Future mlx-lm releases that ship more parsers will be picked up
automatically via the tokenizer.tool_parser attribute when present.

Also tighten the LoadModel logging — the old log line read
init_kwargs.get('tool_parser_type') which doesn't exist on 0.29 and
showed None even when has_tool_calling was True. Log the actual
tool_call_start / tool_call_end markers instead.

While here, switch Free()'s Metal cache clear from the deprecated
mx.metal.clear_cache to mx.clear_cache (mlx >= 0.30), with a
fallback for older releases. Mirrored to the mlx-vlm backend.

* feat(mlx-distributed): mirror MLX parity (tool calls + ChatDelta + sampler)

Same treatment as the mlx and mlx-vlm backends: emit Reply.chat_deltas
with structured tool_calls / reasoning_content / token counts /
logprobs, expand sampling parameter coverage beyond temp+top_p, and
add the missing TokenizeString and Free RPCs.

Notes specific to mlx-distributed:

- Rank 0 is the only rank that owns a sampler — workers participate in
  the pipeline-parallel forward pass via mx.distributed and don't
  re-implement sampling. So the new logits_params (repetition_penalty,
  presence_penalty, frequency_penalty) and stop_words apply on rank 0
  only; we don't need to extend coordinator.broadcast_generation_params,
  which still ships only max_tokens / temperature / top_p to workers
  (everything else is a rank-0 concern).
- Free() now broadcasts CMD_SHUTDOWN to workers when a coordinator is
  active, so they release the model on their end too. The constant is
  already defined and handled by the existing worker loop in
  backend.py:633 (CMD_SHUTDOWN = -1).
- Drop the locally-defined is_float / is_int / parse_options trio in
  favor of python_utils.parse_options, re-exported under the module
  name for back-compat with anything that imported it directly.
- _prepare_prompt: route through messages_to_dicts so tool_call_id /
  tool_calls / reasoning_content survive, pass tools=json.loads(
  request.Tools) and enable_thinking=True to apply_chat_template, fall
  back on TypeError for templates that don't accept those kwargs.
- New _tool_module_from_tokenizer (with the json.loads fallback for
  mlx-lm 0.29.x), _finalize_output, _truncate_at_stop helpers — same
  contract as the mlx backend.
- LoadModel logs the auto-detected has_tool_calling / has_thinking /
  tool_call_start / tool_call_end so users can see what the wrapper
  picked up for the loaded model.
- backend/python/mlx-distributed/test.py: add the same TestSharedHelpers
  unit tests (parse_options, messages_to_dicts, split_reasoning,
  parse_tool_calls) that exist for mlx and mlx-vlm.
2026-04-13 18:44:03 +02:00
Ettore Di Giacinto
d67623230f feat(vllm): parity with llama.cpp backend (#9328)
* fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto

The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.

* refactor(config): introduce backend hook system and migrate llama-cpp defaults

Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().

Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.

* feat(config): add vLLM parser defaults hook and importer auto-detection

Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.

The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.

Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.

* feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs

- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
  output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()

* feat(vllm): CPU support + shared utils + vllm-omni feature parity

- Split vllm install per acceleration: move generic `vllm` out of
  requirements-after.txt into per-profile after files (cublas12, hipblas,
  intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
  messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
  wire native parsers via shared utils, emit ChatDelta with token counts,
  add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
  a small model (Qwen2.5-0.5B-Instruct)

* fix(vllm): CPU build compatibility with vllm 0.14.1

Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).

- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
  GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
  wheel whose torch dependency resolves against published PyTorch builds
  (torch==2.9.1+cpu). Later vllm CPU wheels currently require
  torch==2.10.0+cpu which is only available on the PyTorch test channel
  with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
  so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
  can mix the PyTorch index and PyPI for transitive deps (matches the
  existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
  so the old code path errored out with AttributeError on model load.
  Switch to the new get_tokenizer()/tokenizer accessor with a fallback
  to building the tokenizer directly from request.Model.

* fix(vllm): tool parser constructor compat + e2e tool calling test

Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.

Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.

Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.

test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.

* ci(backend): build cpu-vllm container image

Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.

Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.

* test(e2e-backends): add tools capability + HF model name support

Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
  loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
  ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
  sends a Predict with a get_weather function definition and asserts
  the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
  with OpenAI-style Messages so the backend can wire tools into the
  model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
  e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.

Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes

Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.

* ci(test-extra): run vllm e2e tests on CPU

Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.

Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.

* fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel

The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.

- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
  requirements-cpu-after.txt so installRequirements installs the base
  deps + torch CPU without pulling the prebuilt wheel, then clone vllm
  and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
  target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
  it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
  when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
  recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
  tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
  works on hosts that share the build-time SIMD baseline.

Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.

* ci(vllm): use bigger-runner instead of source build

The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.

Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.

FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.

* ci(vllm): install make + build deps on bigger-runner

bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).

protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.

* ci(vllm): install libnuma1 + libgomp1 on bigger-runner

The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.

* feat(vllm): bundle libnuma/libgomp via package.sh

The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.

- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
  libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
  preserving soname symlinks. Runs during Dockerfile.python's
  'Run backend-specific packaging' step (which already invokes
  package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
  builder stage so package.sh has something to copy (the Ubuntu
  base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
  the runner host — with the backend image self-contained, the
  runner no longer needs them, and the test now exercises the
  packaging path end-to-end the way a production host would.

* ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)

Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.

FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.

Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
  - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
  - vllm publishes a CPU wheel with a wider baseline, or
  - we set up a docker layer cache that makes FROM_SOURCE acceptable

The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
2026-04-13 11:00:29 +02:00
Ettore Di Giacinto
2865f0f8d3 feat(ux): backend management enhancement (#9325)
* feat: add PreferDevelopmentBackends setting, expose isMeta/isDevelopment in API

- Add PreferDevelopmentBackends config field, CLI flag, runtime setting
- Add IsDevelopment() method to GalleryBackend
- Use AvailableBackendsUnfiltered in UI API to show all backends
- Expose isMeta, isDevelopment, preferDevelopmentBackends in backend API response

* feat: upgrade banner with Upgrade All button, detect pre-existing backends

- Add upgrade banner on Backends page showing count and Upgrade All button
- Fix upgrade detection for backends installed before version tracking:
  flag as upgradeable when gallery has a version but installed has none
- Fix OCI digest check to flag backends with no stored digest as upgradeable
2026-04-12 00:35:22 +02:00
Ettore Di Giacinto
8ab0744458 feat: backend versioning, upgrade detection and auto-upgrade (#9315)
* feat: add backend versioning data model foundation

Add Version, URI, and Digest fields to BackendMetadata for tracking
installed backend versions and enabling upgrade detection. Add Version
field to GalleryBackend. Add UpgradeAvailable/AvailableVersion fields
to SystemBackend. Implement GetImageDigest() for lightweight OCI digest
lookups via remote.Head. Record version, URI, and digest at install time
in InstallBackend() and propagate version through meta backends.

* feat: add backend upgrade detection and execution logic

Add CheckBackendUpgrades() to compare installed backend versions/digests
against gallery entries, and UpgradeBackend() to perform atomic upgrades
with backup-based rollback on failure. Includes Agent A's data model
changes (Version/URI/Digest fields, GetImageDigest).

* feat: add AutoUpgradeBackends config and runtime settings

Add configuration and runtime settings for backend auto-upgrade:
- RuntimeSettings field for dynamic config via API/JSON
- ApplicationConfig field, option func, and roundtrip conversion
- CLI flag with LOCALAI_AUTO_UPGRADE_BACKENDS env var
- Config file watcher support for runtime_settings.json
- Tests for ToRuntimeSettings, ApplyRuntimeSettings, and roundtrip

* feat(ui): add backend version display and upgrade support

- Add upgrade check/trigger API endpoints to config and api module
- Backends page: version badge, upgrade indicator, upgrade button
- Manage page: version in metadata, context-aware upgrade/reinstall button
- Settings page: auto-upgrade backends toggle

* feat: add upgrade checker service, API endpoints, and CLI command

- UpgradeChecker background service: checks every 6h, auto-upgrades when enabled
- API endpoints: GET /backends/upgrades, POST /backends/upgrades/check, POST /backends/upgrade/:name
- CLI: `localai backends upgrade` command, version display in `backends list`
- BackendManager interface: add UpgradeBackend and CheckUpgrades methods
- Wire upgrade op through GalleryService backend handler
- Distributed mode: fan-out upgrade to worker nodes via NATS

* fix: use advisory lock for upgrade checker in distributed mode

In distributed mode with multiple frontend instances, use PostgreSQL
advisory lock (KeyBackendUpgradeCheck) so only one instance runs
periodic upgrade checks and auto-upgrades. Prevents duplicate
upgrade operations across replicas.

Standalone mode is unchanged (simple ticker loop).

* test: add e2e tests for backend upgrade API

- Test GET /api/backends/upgrades returns 200 (even with no upgrade checker)
- Test POST /api/backends/upgrade/:name accepts request and returns job ID
- Test full upgrade flow: trigger upgrade via API, wait for job completion,
  verify run.sh updated to v2 and metadata.json has version 2.0.0
- Test POST /api/backends/upgrades/check returns 200
- Fix nil check for applicationInstance in upgrade API routes
2026-04-11 22:31:15 +02:00
Ettore Di Giacinto
5c35e85fe2 feat: allow to pin models and skip from reaping (#9309)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-11 08:38:17 +02:00
Leigh Phillips
062e0d0d00 feat: Add toggle mechanism to enable/disable models from loading on demand (#9304)
* feat: add toggle mechanism to enable/disable models from loading on demand

Implements #9303 - Adds ability to disable models from being auto-loaded
while keeping them in the collection.

Backend changes:
- Add Disabled field to ModelConfig struct with IsDisabled() getter
- New ToggleModelEndpoint handler (PUT /models/toggle/:name/:action)
- Request middleware returns 403 when disabled model is requested
- Capabilities endpoint exposes disabled status

Frontend changes:
- Toggle switch in System > Models table Actions column
- Visual indicators: dimmed row, red Disabled badge, muted icons
- Tooltip describes toggle function on hover
- Loading state while API call is in progress

* fix: remove extra closing brace causing syntax error in request middleware

* refactor: reorder Actions column - Stop button before toggle switch

* refactor: migrate from toggle to toggle-state per PR review feedback
2026-04-10 18:17:41 +02:00
Ettore Di Giacinto
9748a1cbc6 fix(streaming): skip chat deltas for role-init elements to prevent first token duplication (#9299)
When TASK_RESPONSE_TYPE_OAI_CHAT is used, the first streaming token
produces a JSON array with two elements: a role-init chunk and the
actual content chunk. The grpc-server loop called attach_chat_deltas
for both elements with the same raw_result pointer, stamping the first
token's ChatDelta.Content on both replies. The Go side accumulated both,
emitting the first content token twice to SSE clients.

Fix: in the array iteration loops in PredictStream, detect role-init
elements (delta has "role" key) and skip attach_chat_deltas for them.
Only content/reasoning elements get chat deltas attached.

Reasoning models are unaffected because their first token goes into
reasoning_content, not content.
2026-04-10 08:45:47 +02:00
Ettore Di Giacinto
e1a6010874 fix(streaming): deduplicate tool call emissions during streaming (#9292)
The Go-side incremental JSON parser was emitting the same tool call on
every streaming token because it lacked the len > lastEmittedCount guard
that the XML parser had. On top of that, the post-streaming default:
case re-emitted all tool calls from index 0, duplicating everything.

This produced duplicate delta.tool_calls events causing clients to
accumulate arguments as "{args}{args}" — invalid JSON.

Fixes:
- JSON incremental parser: add len(jsonResults) > lastEmittedCount guard
  and loop from lastEmittedCount (matching the XML parser pattern)
- Post-streaming default: case: skip i < lastEmittedCount entries that
  were already emitted during streaming
- JSON parser: use blocking channel send (matching XML parser behavior)
2026-04-10 00:44:25 +02:00
Ettore Di Giacinto
706cf5d43c feat(sam.cpp): add sam.cpp detection backend (#9288)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 21:49:11 +02:00
Ettore Di Giacinto
13a6ed709c fix: thinking models with tools returning empty content (reasoning-only retry loop) (#9290)
When clients like Nextcloud or Home Assistant send requests with tools
to thinking models (e.g. Gemma 4 with <|channel>thought tags), the
response was empty despite the backend producing valid content.

Root cause: the C++ autoparser puts clean content in both the raw
Response and ChatDeltas. The Go-side PrependThinkingTokenIfNeeded
then prepends the thinking start token to the already-clean content,
causing ExtractReasoning to classify the entire response as unclosed
reasoning. This made cbRawResult empty, triggering a retry loop that
never succeeds.

Two fixes:
- inference.go: check ChatDeltas for content/tool_calls regardless of
  whether Response is empty, so skipCallerRetry fires correctly
- chat.go: when ChatDeltas have content but no tool calls, use that
  content directly instead of falling back to the empty cbRawResult
2026-04-09 18:30:31 +02:00
Ettore Di Giacinto
85be4ff03c feat(api): add ollama compatibility (#9284)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 14:15:14 +02:00
Ettore Di Giacinto
e00ce981f0 fix: try to add whisperx and faster-whisper for more variants (#9278)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 21:23:38 +02:00
Ettore Di Giacinto
39c6b3ed66 feat: track files being staged (#9275)
This changeset makes visible when files are being staged, so users are
aware that the model "isn't ready yet" for requests.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 14:33:58 +02:00
Ettore Di Giacinto
0e9d1a6588 chore(ci): drop unnecessary test
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 12:19:54 +00:00
Ettore Di Giacinto
154fa000d3 fix(autoscaling): extract load model from Route() and use as well when doing autoscale (#9270)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-08 08:27:51 +02:00
Richard Palethorpe
9ac1bdc587 feat(ui): Interactive model config editor with autocomplete (#9149)
* feat(ui): Add dynamic model editor with autocomplete

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(docs): Add link to longformat installation video

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-07 14:42:23 +02:00
Ettore Di Giacinto
505c417fa7 fix(gpu): better detection for MacOS and Thor (#9263)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-07 00:39:07 +02:00
Ettore Di Giacinto
0f9d516a6c fix(anthropic): do not emit empty tokens and fix SSE tool calls (#9258)
This fixes Claude Code compatibility

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-07 00:38:21 +02:00
Ettore Di Giacinto
92f99b1ec3 fix(token): login via legacy api keys (#9249)
We were not checking against the api keys when db == nil.

This commit also cleanups now unused middleware

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-06 21:45:09 +02:00
Ettore Di Giacinto
773489eeb1 fix(chat): do not retry if we had chatdeltas or tooldeltas from backend (#9244)
* fix(chat): do not retry if we had chatdeltas or tooldeltas from backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: use oai compat for llama.cpp

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: apply to non-streaming path too

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* map also other fields

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-06 10:52:23 +02:00
Ettore Di Giacinto
232e324a68 fix(autoparser): correctly pass by logprobs (#9239)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-05 09:39:22 +02:00
Ettore Di Giacinto
53deeb1107 fix(reasoning): suppress partial tag tokens during autoparser warm-up
The C++ PEG parser needs a few tokens to identify the reasoning format
(e.g. "<|channel>thought\n" for Gemma 4). During this warm-up, the gRPC
layer was sending raw partial tag tokens to Go, which leaked into the
reasoning field.

- Clear reply.message in gRPC when autoparser is active but has no diffs
  yet, matching llama.cpp server behavior of only emitting classified output
- Prefer C++ autoparser chat deltas for reasoning/content in all streaming
  paths, falling back to Go-side extraction for backends without autoparser
  (e.g. vLLM)
- Override non-streaming no-tools result with chat delta content when available
- Guard PrependThinkingTokenIfNeeded against partial tag prefixes during
  streaming accumulation
- Reorder default thinking tokens so <|channel>thought is checked before
  <|think|> (Gemma 4 templates contain both)
2026-04-04 20:45:57 +00:00
Ettore Di Giacinto
c5a840f6af fix(reasoning): warm-up
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 20:25:24 +00:00
Ettore Di Giacinto
6d9d77d590 fix(reasoning): accumulate and strip reasoning tags from autoparser results (#9227)
fix(reasoning): acccumulate and strip reasoning tags from autoparser results

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 18:15:32 +02:00
Ettore Di Giacinto
6f304d1201 chore(refactor): use interface (#9226)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 17:29:37 +02:00
Richard Palethorpe
557d0f0f04 feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-04 15:14:35 +02:00
Ettore Di Giacinto
b7e3589875 fix(anthropic): show null index when not present, default to 0 (#9225)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 15:13:17 +02:00
Ettore Di Giacinto
716ddd697b feat(autoparser): prefer chat deltas from backends when emitted (#9224)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:12:08 +02:00
Ettore Di Giacinto
223deb908d fix(nats): improve error handling (#9222)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:54 +02:00
Ettore Di Giacinto
9f8821bba8 feat(gemma4): add thinking support (#9221)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:38 +02:00
Ettore Di Giacinto
84e51b68ef fix(ui): pass by staticApiKeyRequired to show login when only api key is configured (#9220)
This fixes #9213

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-04 12:11:22 +02:00
github-actions[bot]
57c0026715 chore: bump inference defaults from unsloth (#9219)
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 09:44:12 +02:00
Ettore Di Giacinto
6c635e8353 feat: add resume endpoint to undrain nodes (#9197)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-01 18:21:43 +02:00
Ettore Di Giacinto
6b6c136210 fix(inflight): count inflight from load model, but release afterwards (#9194)
This should fix the count of 1 in flight always showing in the node list

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 23:24:45 +02:00
Ettore Di Giacinto
e587ecc485 chore(ui): allow to unload forcefully
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 17:20:53 +00:00
Ettore Di Giacinto
221ff0f28f feat(ui): show cluster status in home in distributed mode
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 15:37:58 +00:00
Ettore Di Giacinto
16d5cb00bd chore: css cleanups
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 16:37:38 +02:00
Richard Palethorpe
952635fba6 feat(distributed): Avoid resending models to backend nodes (#9193)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-31 16:28:13 +02:00
Ettore Di Giacinto
3cc05af2e5 chore(nodes): restore offline nodes too
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 14:22:18 +00:00
Richard Palethorpe
efdcbbe332 feat(api): Return 404 when model is not found except for model names in HF format (#9133)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-31 10:48:21 +02:00
Ettore Di Giacinto
b4fff9293d chore: small ui improvements in the node page
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:41:40 +00:00
Ettore Di Giacinto
3db12eaa7a fix(oauth/invite): do not register user (prending approval) without correct invite (#9189)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:29:07 +02:00
Ettore Di Giacinto
8862e3ce60 feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186)
* always enable parallel requests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: move tests to ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(smart router): order by available vram

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:28:56 +02:00
Ettore Di Giacinto
dd3376e0a9 chore(workers): improve logging, set header timeouts (#9171)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 17:26:55 +02:00
Richard Palethorpe
c2f7d1c18b feat(ui): Add media history to studio pages (e.g. past images) (#9151)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-30 00:49:55 +02:00