Commit Graph

161 Commits

Author SHA1 Message Date
leinasi2014
486b5e25a3 fix(config): ignore yaml backup files in model loader (#9443)
Only load files whose real extension is .yaml or .yml so backup files like model.yaml.bak do not override active configs. Add a regression test covering plain and timestamped backup files.

Assisted-by: Codex:gpt-5.4 docker

Signed-off-by: leinasi2014 <leinasi2014@gmail.com>
2026-04-20 23:41:39 +02:00
Ettore Di Giacinto
7809c5f5d0 fix(vision): propagate mtmd media marker from backend via ModelMetadata (#9412)
Upstream llama.cpp (PR #21962) switched the server-side mtmd media
marker to a random per-server string and removed the legacy
"<__media__>" backward-compat replacement in mtmd_tokenizer. The
Go layer still emitted the hardcoded "<__media__>", so on the
non-tokenizer-template path the prompt arrived with a marker mtmd
did not recognize and tokenization failed with "number of bitmaps
(1) does not match number of markers (0)".

Report the active media marker via ModelMetadataResponse.media_marker
and substitute the sentinel "<__media__>" with it right before the
gRPC call, after the backend has been loaded and probed. Also skip
the Go-side multimodal templating entirely when UseTokenizerTemplate
is true — llama.cpp's oaicompat_chat_params_parse already injects its
own marker and StringContent is unused in that path. Backends that do
not expose the field keep the legacy "<__media__>" behavior.
2026-04-18 20:30:13 +02:00
github-actions[bot]
48e87db400 chore: bump inference defaults from unsloth (#9396)
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-17 09:05:55 +02:00
Ettore Di Giacinto
d67623230f feat(vllm): parity with llama.cpp backend (#9328)
* fix(schema): serialize ToolCallID and Reasoning in Messages.ToProto

The ToProto conversion was dropping tool_call_id and reasoning_content
even though both proto and Go fields existed, breaking multi-turn tool
calling and reasoning passthrough to backends.

* refactor(config): introduce backend hook system and migrate llama-cpp defaults

Adds RegisterBackendHook/runBackendHooks so each backend can register
default-filling functions that run during ModelConfig.SetDefaults().

Migrates the existing GGUF guessing logic into hooks_llamacpp.go,
registered for both 'llama-cpp' and the empty backend (auto-detect).
Removes the old guesser.go shim.

* feat(config): add vLLM parser defaults hook and importer auto-detection

Introduces parser_defaults.json mapping model families to vLLM
tool_parser/reasoning_parser names, with longest-pattern-first matching.

The vllmDefaults hook auto-fills tool_parser and reasoning_parser
options at load time for known families, while the VLLMImporter writes
the same values into generated YAML so users can review and edit them.

Adds tests covering MatchParserDefaults, hook registration via
SetDefaults, and the user-override behavior.

* feat(vllm): wire native tool/reasoning parsers + chat deltas + logprobs

- Use vLLM's ToolParserManager/ReasoningParserManager to extract structured
  output (tool calls, reasoning content) instead of reimplementing parsing
- Convert proto Messages to dicts and pass tools to apply_chat_template
- Emit ChatDelta with content/reasoning_content/tool_calls in Reply
- Extract prompt_tokens, completion_tokens, and logprobs from output
- Replace boolean GuidedDecoding with proper GuidedDecodingParams from Grammar
- Add TokenizeString and Free RPC methods
- Fix missing `time` import used by load_video()

* feat(vllm): CPU support + shared utils + vllm-omni feature parity

- Split vllm install per acceleration: move generic `vllm` out of
  requirements-after.txt into per-profile after files (cublas12, hipblas,
  intel) and add CPU wheel URL for cpu-after.txt
- requirements-cpu.txt now pulls torch==2.7.0+cpu from PyTorch CPU index
- backend/index.yaml: register cpu-vllm / cpu-vllm-development variants
- New backend/python/common/vllm_utils.py: shared parse_options,
  messages_to_dicts, setup_parsers helpers (used by both vllm backends)
- vllm-omni: replace hardcoded chat template with tokenizer.apply_chat_template,
  wire native parsers via shared utils, emit ChatDelta with token counts,
  add TokenizeString and Free RPCs, detect CPU and set VLLM_TARGET_DEVICE
- Add test_cpu_inference.py: standalone script to validate CPU build with
  a small model (Qwen2.5-0.5B-Instruct)

* fix(vllm): CPU build compatibility with vllm 0.14.1

Validated end-to-end on CPU with Qwen2.5-0.5B-Instruct (LoadModel, Predict,
TokenizeString, Free all working).

- requirements-cpu-after.txt: pin vllm to 0.14.1+cpu (pre-built wheel from
  GitHub releases) for x86_64 and aarch64. vllm 0.14.1 is the newest CPU
  wheel whose torch dependency resolves against published PyTorch builds
  (torch==2.9.1+cpu). Later vllm CPU wheels currently require
  torch==2.10.0+cpu which is only available on the PyTorch test channel
  with incompatible torchvision.
- requirements-cpu.txt: bump torch to 2.9.1+cpu, add torchvision/torchaudio
  so uv resolves them consistently from the PyTorch CPU index.
- install.sh: add --index-strategy=unsafe-best-match for CPU builds so uv
  can mix the PyTorch index and PyPI for transitive deps (matches the
  existing intel profile behaviour).
- backend.py LoadModel: vllm >= 0.14 removed AsyncLLMEngine.get_model_config
  so the old code path errored out with AttributeError on model load.
  Switch to the new get_tokenizer()/tokenizer accessor with a fallback
  to building the tokenizer directly from request.Model.

* fix(vllm): tool parser constructor compat + e2e tool calling test

Concrete vLLM tool parsers override the abstract base's __init__ and
drop the tools kwarg (e.g. Hermes2ProToolParser only takes tokenizer).
Instantiating with tools= raised TypeError which was silently caught,
leaving chat_deltas.tool_calls empty.

Retry the constructor without the tools kwarg on TypeError — tools
aren't required by these parsers since extract_tool_calls finds tool
syntax in the raw model output directly.

Validated with Qwen/Qwen2.5-0.5B-Instruct + hermes parser on CPU:
the backend correctly returns ToolCallDelta{name='get_weather',
arguments='{"location": "Paris, France"}'} in ChatDelta.

test_tool_calls.py is a standalone smoke test that spawns the gRPC
backend, sends a chat completion with tools, and asserts the response
contains a structured tool call.

* ci(backend): build cpu-vllm container image

Add the cpu-vllm variant to the backend container build matrix so the
image registered in backend/index.yaml (cpu-vllm / cpu-vllm-development)
is actually produced by CI.

Follows the same pattern as the other CPU python backends
(cpu-diffusers, cpu-chatterbox, etc.) with build-type='' and no CUDA.
backend_pr.yml auto-picks this up via its matrix filter from backend.yml.

* test(e2e-backends): add tools capability + HF model name support

Extends tests/e2e-backends to cover backends that:
- Resolve HuggingFace model ids natively (vllm, vllm-omni) instead of
  loading a local file: BACKEND_TEST_MODEL_NAME is passed verbatim as
  ModelOptions.Model with no download/ModelFile.
- Parse tool calls into ChatDelta.tool_calls: new "tools" capability
  sends a Predict with a get_weather function definition and asserts
  the Reply contains a matching ToolCallDelta. Uses UseTokenizerTemplate
  with OpenAI-style Messages so the backend can wire tools into the
  model's chat template.
- Need backend-specific Options[]: BACKEND_TEST_OPTIONS lets a test set
  e.g. "tool_parser:hermes,reasoning_parser:qwen3" at LoadModel time.

Adds make target test-extra-backend-vllm that:
- docker-build-vllm
- loads Qwen/Qwen2.5-0.5B-Instruct
- runs health,load,predict,stream,tools with tool_parser:hermes

Drops backend/python/vllm/test_{cpu_inference,tool_calls}.py — those
standalone scripts were scaffolding used while bringing up the Python
backend; the e2e-backends harness now covers the same ground uniformly
alongside llama-cpp and ik-llama-cpp.

* ci(test-extra): run vllm e2e tests on CPU

Adds tests-vllm-grpc to the test-extra workflow, mirroring the
llama-cpp and ik-llama-cpp gRPC jobs. Triggers when files under
backend/python/vllm/ change (or on run-all), builds the local-ai
vllm container image, and runs the tests/e2e-backends harness with
BACKEND_TEST_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct, tool_parser:hermes,
and the tools capability enabled.

Uses ubuntu-latest (no GPU) — vllm runs on CPU via the cpu-vllm
wheel we pinned in requirements-cpu-after.txt. Frees disk space
before the build since the docker image + torch + vllm wheel is
sizeable.

* fix(vllm): build from source on CI to avoid SIGILL on prebuilt wheel

The prebuilt vllm 0.14.1+cpu wheel from GitHub releases is compiled with
SIMD instructions (AVX-512 VNNI/BF16 or AMX-BF16) that not every CPU
supports. GitHub Actions ubuntu-latest runners SIGILL when vllm spawns
the model_executor.models.registry subprocess for introspection, so
LoadModel never reaches the actual inference path.

- install.sh: when FROM_SOURCE=true on a CPU build, temporarily hide
  requirements-cpu-after.txt so installRequirements installs the base
  deps + torch CPU without pulling the prebuilt wheel, then clone vllm
  and compile it with VLLM_TARGET_DEVICE=cpu. The resulting binaries
  target the host's actual CPU.
- backend/Dockerfile.python: accept a FROM_SOURCE build-arg and expose
  it as an ENV so install.sh sees it during `make`.
- Makefile docker-build-backend: forward FROM_SOURCE as --build-arg
  when set, so backends that need source builds can opt in.
- Makefile test-extra-backend-vllm: call docker-build-vllm via a
  recursive $(MAKE) invocation so FROM_SOURCE flows through.
- .github/workflows/test-extra.yml: set FROM_SOURCE=true on the
  tests-vllm-grpc job. Slower but reliable — the prebuilt wheel only
  works on hosts that share the build-time SIMD baseline.

Answers 'did you test locally?': yes, end-to-end on my local machine
with the prebuilt wheel (CPU supports AVX-512 VNNI). The CI runner CPU
gap was not covered locally — this commit plugs that gap.

* ci(vllm): use bigger-runner instead of source build

The prebuilt vllm 0.14.1+cpu wheel requires SIMD instructions (AVX-512
VNNI/BF16) that stock ubuntu-latest GitHub runners don't support —
vllm.model_executor.models.registry SIGILLs on import during LoadModel.

Source compilation works but takes 30-40 minutes per CI run, which is
too slow for an e2e smoke test. Instead, switch tests-vllm-grpc to the
bigger-runner self-hosted label (already used by backend.yml for the
llama-cpp CUDA build) — that hardware has the required SIMD baseline
and the prebuilt wheel runs cleanly.

FROM_SOURCE=true is kept as an opt-in escape hatch:
- install.sh still has the CPU source-build path for hosts that need it
- backend/Dockerfile.python still declares the ARG + ENV
- Makefile docker-build-backend still forwards the build-arg when set
Default CI path uses the fast prebuilt wheel; source build can be
re-enabled by exporting FROM_SOURCE=true in the environment.

* ci(vllm): install make + build deps on bigger-runner

bigger-runner is a bare self-hosted runner used by backend.yml for
docker image builds — it has docker but not the usual ubuntu-latest
toolchain. The make-based test target needs make, build-essential
(cgo in 'go test'), and curl/unzip (the Makefile protoc target
downloads protoc from github releases).

protoc-gen-go and protoc-gen-go-grpc come via 'go install' in the
install-go-tools target, which setup-go makes possible.

* ci(vllm): install libnuma1 + libgomp1 on bigger-runner

The vllm 0.14.1+cpu wheel ships a _C C++ extension that dlopens
libnuma.so.1 at import time. When the runner host doesn't have it,
the extension silently fails to register its torch ops, so
EngineCore crashes on init_device with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Also add libgomp1 (OpenMP runtime, used by torch CPU kernels) to be
safe on stripped-down runners.

* feat(vllm): bundle libnuma/libgomp via package.sh

The vllm CPU wheel ships a _C extension that dlopens libnuma.so.1 at
import time; torch's CPU kernels in turn use libgomp.so.1 (OpenMP).
Without these on the host, vllm._C silently fails to register its
torch ops and EngineCore crashes with:

  AttributeError: '_OpNamespace' '_C_utils' object has no attribute
    'init_cpu_threads_env'

Rather than asking every user to install libnuma1/libgomp1 on their
host (or every LocalAI base image to ship them), bundle them into
the backend image itself — same pattern fish-speech and the GPU libs
already use. libbackend.sh adds ${EDIR}/lib to LD_LIBRARY_PATH at
run time so the bundled copies are picked up automatically.

- backend/python/vllm/package.sh (new): copies libnuma.so.1 and
  libgomp.so.1 from the builder's multilib paths into ${BACKEND}/lib,
  preserving soname symlinks. Runs during Dockerfile.python's
  'Run backend-specific packaging' step (which already invokes
  package.sh if present).
- backend/Dockerfile.python: install libnuma1 + libgomp1 in the
  builder stage so package.sh has something to copy (the Ubuntu
  base image otherwise only has libgomp in the gcc dep chain).
- test-extra.yml: drop the workaround that installed these libs on
  the runner host — with the backend image self-contained, the
  runner no longer needs them, and the test now exercises the
  packaging path end-to-end the way a production host would.

* ci(vllm): disable tests-vllm-grpc job (heterogeneous runners)

Both ubuntu-latest and bigger-runner have inconsistent CPU baselines:
some instances support the AVX-512 VNNI/BF16 instructions the prebuilt
vllm 0.14.1+cpu wheel was compiled with, others SIGILL on import of
vllm.model_executor.models.registry. The libnuma packaging fix doesn't
help when the wheel itself can't be loaded.

FROM_SOURCE=true compiles vllm against the actual host CPU and works
everywhere, but takes 30-50 minutes per run — too slow for a smoke
test on every PR.

Comment out the job for now. The test itself is intact and passes
locally; run it via 'make test-extra-backend-vllm' on a host with the
required SIMD baseline. Re-enable when:
  - we have a self-hosted runner label with guaranteed AVX-512 VNNI/BF16, or
  - vllm publishes a CPU wheel with a wider baseline, or
  - we set up a docker layer cache that makes FROM_SOURCE acceptable

The detect-changes vllm output, the test harness changes (tests/
e2e-backends + tools cap), the make target (test-extra-backend-vllm),
the package.sh and the Dockerfile/install.sh plumbing all stay in
place.
2026-04-13 11:00:29 +02:00
Ettore Di Giacinto
2865f0f8d3 feat(ux): backend management enhancement (#9325)
* feat: add PreferDevelopmentBackends setting, expose isMeta/isDevelopment in API

- Add PreferDevelopmentBackends config field, CLI flag, runtime setting
- Add IsDevelopment() method to GalleryBackend
- Use AvailableBackendsUnfiltered in UI API to show all backends
- Expose isMeta, isDevelopment, preferDevelopmentBackends in backend API response

* feat: upgrade banner with Upgrade All button, detect pre-existing backends

- Add upgrade banner on Backends page showing count and Upgrade All button
- Fix upgrade detection for backends installed before version tracking:
  flag as upgradeable when gallery has a version but installed has none
- Fix OCI digest check to flag backends with no stored digest as upgradeable
2026-04-12 00:35:22 +02:00
Ettore Di Giacinto
8ab0744458 feat: backend versioning, upgrade detection and auto-upgrade (#9315)
* feat: add backend versioning data model foundation

Add Version, URI, and Digest fields to BackendMetadata for tracking
installed backend versions and enabling upgrade detection. Add Version
field to GalleryBackend. Add UpgradeAvailable/AvailableVersion fields
to SystemBackend. Implement GetImageDigest() for lightweight OCI digest
lookups via remote.Head. Record version, URI, and digest at install time
in InstallBackend() and propagate version through meta backends.

* feat: add backend upgrade detection and execution logic

Add CheckBackendUpgrades() to compare installed backend versions/digests
against gallery entries, and UpgradeBackend() to perform atomic upgrades
with backup-based rollback on failure. Includes Agent A's data model
changes (Version/URI/Digest fields, GetImageDigest).

* feat: add AutoUpgradeBackends config and runtime settings

Add configuration and runtime settings for backend auto-upgrade:
- RuntimeSettings field for dynamic config via API/JSON
- ApplicationConfig field, option func, and roundtrip conversion
- CLI flag with LOCALAI_AUTO_UPGRADE_BACKENDS env var
- Config file watcher support for runtime_settings.json
- Tests for ToRuntimeSettings, ApplyRuntimeSettings, and roundtrip

* feat(ui): add backend version display and upgrade support

- Add upgrade check/trigger API endpoints to config and api module
- Backends page: version badge, upgrade indicator, upgrade button
- Manage page: version in metadata, context-aware upgrade/reinstall button
- Settings page: auto-upgrade backends toggle

* feat: add upgrade checker service, API endpoints, and CLI command

- UpgradeChecker background service: checks every 6h, auto-upgrades when enabled
- API endpoints: GET /backends/upgrades, POST /backends/upgrades/check, POST /backends/upgrade/:name
- CLI: `localai backends upgrade` command, version display in `backends list`
- BackendManager interface: add UpgradeBackend and CheckUpgrades methods
- Wire upgrade op through GalleryService backend handler
- Distributed mode: fan-out upgrade to worker nodes via NATS

* fix: use advisory lock for upgrade checker in distributed mode

In distributed mode with multiple frontend instances, use PostgreSQL
advisory lock (KeyBackendUpgradeCheck) so only one instance runs
periodic upgrade checks and auto-upgrades. Prevents duplicate
upgrade operations across replicas.

Standalone mode is unchanged (simple ticker loop).

* test: add e2e tests for backend upgrade API

- Test GET /api/backends/upgrades returns 200 (even with no upgrade checker)
- Test POST /api/backends/upgrade/:name accepts request and returns job ID
- Test full upgrade flow: trigger upgrade via API, wait for job completion,
  verify run.sh updated to v2 and metadata.json has version 2.0.0
- Test POST /api/backends/upgrades/check returns 200
- Fix nil check for applicationInstance in upgrade API routes
2026-04-11 22:31:15 +02:00
Ettore Di Giacinto
5c35e85fe2 feat: allow to pin models and skip from reaping (#9309)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-11 08:38:17 +02:00
Leigh Phillips
062e0d0d00 feat: Add toggle mechanism to enable/disable models from loading on demand (#9304)
* feat: add toggle mechanism to enable/disable models from loading on demand

Implements #9303 - Adds ability to disable models from being auto-loaded
while keeping them in the collection.

Backend changes:
- Add Disabled field to ModelConfig struct with IsDisabled() getter
- New ToggleModelEndpoint handler (PUT /models/toggle/:name/:action)
- Request middleware returns 403 when disabled model is requested
- Capabilities endpoint exposes disabled status

Frontend changes:
- Toggle switch in System > Models table Actions column
- Visual indicators: dimmed row, red Disabled badge, muted icons
- Tooltip describes toggle function on hover
- Loading state while API call is in progress

* fix: remove extra closing brace causing syntax error in request middleware

* refactor: reorder Actions column - Stop button before toggle switch

* refactor: migrate from toggle to toggle-state per PR review feedback
2026-04-10 18:17:41 +02:00
Ettore Di Giacinto
706cf5d43c feat(sam.cpp): add sam.cpp detection backend (#9288)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 21:49:11 +02:00
Ettore Di Giacinto
85be4ff03c feat(api): add ollama compatibility (#9284)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-09 14:15:14 +02:00
Richard Palethorpe
9ac1bdc587 feat(ui): Interactive model config editor with autocomplete (#9149)
* feat(ui): Add dynamic model editor with autocomplete

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* chore(docs): Add link to longformat installation video

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-07 14:42:23 +02:00
Richard Palethorpe
557d0f0f04 feat(api): Allow coding agents to interactively discover how to control and configure LocalAI (#9084)
Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-04-04 15:14:35 +02:00
github-actions[bot]
57c0026715 chore: bump inference defaults from unsloth (#9219)
Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
2026-04-04 09:44:12 +02:00
Ettore Di Giacinto
8862e3ce60 feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler (#9186)
* always enable parallel requests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: move tests to ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(smart router): order by available vram

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-31 08:28:56 +02:00
Ettore Di Giacinto
59108fbe32 feat: add distributed mode (#9124)
* feat: add distributed mode (experimental)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix data races, mutexes, transactions

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix events and tool stream in agent chat

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* use ginkgo

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(cron): compute correctly time boundaries avoiding re-triggering

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not flood of healthy checks

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* do not list obvious backends as text backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* tests fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* refactoring and consolidation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop redundant healthcheck

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* enhancements, refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-30 00:47:27 +02:00
Ettore Di Giacinto
031a36c995 feat: inferencing default, automatic tool parsing fallback and wire min_p (#9092)
* feat: wire min_p

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: inferencing defaults

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(refactor): re-use iterative parser

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: generate automatically inference defaults from unsloth

Instead of trying to re-invent the wheel and maintain here the inference
defaults, prefer to consume unsloth ones, and contribute there as
necessary.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: apply defaults also to models installed via gallery

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: be consistent and apply fallback to all endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:57:15 +01:00
Ettore Di Giacinto
f7e8d9e791 feat(quantization): add quantization backend (#9096)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-22 00:56:34 +01:00
Ettore Di Giacinto
d9c1db2b87 feat: add (experimental) fine-tuning support with TRL (#9088)
* feat: add fine-tuning endpoint

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(experimental): add fine-tuning endpoint and TRL support

This changeset defines new GRPC signatues for Fine tuning backends, and
add TRL backend as initial fine-tuning engine. This implementation also
supports exporting to GGUF and automatically importing it to LocalAI
after fine-tuning.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* commit TRL backend, stop by killing process

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* move fine-tune to generic features

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* add evals, reorder menu

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fix tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-21 02:08:02 +01:00
Richard Palethorpe
cb63bdb9e4 feat(ui): Add model pipeline editor (#9070)
This creates a new model config page. Presently just allows configuring
pipelines, but can be extending the future to other types of models.
However pipelines are quite easy to create a form for and require
editing to create.

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-20 15:07:34 +01:00
Ettore Di Giacinto
c3174f9543 chore(deps): bump llama-cpp to 'a0bbcdd9b6b83eeeda6f1216088f42c33d464e38' (#9079)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-20 08:12:21 +01:00
Ettore Di Giacinto
aea21951a2 feat: add users and authentication support (#9061)
* feat(ui): add users and authentication support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: allow the admin user to impersonificate users

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: ui improvements, disable 'Users' button in navbar when no auth is configured

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: add OIDC support

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: gate models

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: cache requests to optimize speed

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* small UI enhancements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore(ui): style improvements

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: cover other paths by auth

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: separate local auth, refactor

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* security hardening, approval mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: fix tests and expectations

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: update localagi/localrecall

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-19 21:40:51 +01:00
Richard Palethorpe
35d509d8e7 feat(ui): Per model backend logs and various fixes (#9028)
* feat(gallery): Switch to expandable box instead of pop-over and display model files

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(ui, backends): Add individual backend logging

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(ui): Set the context settings from the model config

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2026-03-18 08:31:26 +01:00
Ettore Di Giacinto
a738f8b0e4 feat(backends): add ace-step.cpp (#8965)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-12 18:56:26 +01:00
Ettore Di Giacinto
a026277ab9 feat(mlx-distributed): add new MLX-distributed backend (#8801)
* feat(mlx-distributed): add new MLX-distributed backend

Add new MLX distributed backend with support for both TCP and RDMA for
model sharding.

This implementation ties in the discovery implementation already in
place, and re-uses the same P2P mechanism for the TCP MLX-distributed
inferencing.

The Auto-parallel implementation is inspired by Exo's
ones (who have been added to acknowledgement for the great work!)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose a CLI to facilitate backend starting

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: make manual rank0 configurable via model configs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add missing features from mlx backend

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-03-09 17:29:32 +01:00
LocalAI [bot]
d200401e86 feat: Add --data-path CLI flag for persistent data separation (#8888)
feat: add --data-path CLI flag for persistent data separation

- Add LOCALAI_DATA_PATH environment variable and --data-path CLI flag
- Default data path: /data (separate from configuration directory)
- Automatic migration on startup: moves agent_tasks.json, agent_jobs.json, collections/, and assets/ from old config dir to new data path
- Backward compatible: preserves old behavior if LOCALAI_DATA_PATH is not set
- Agent state and job directories now use DataPath with proper fallback chain
- Update documentation with new flag and docker-compose example

This separates mutable persistent data (collectiondb, agents, assets, skills) from configuration files, enabling better volume mounting and data persistence in containerized deployments.

Signed-off-by: localai-bot <localai-bot@noreply.github.com>
Co-authored-by: localai-bot <localai-bot@noreply.github.com>
2026-03-09 14:11:15 +01:00
Ettore Di Giacinto
b2f81bfa2e feat(functions): add peg-based parsing and allow backends to return tool calls directly (#8838)
* feat(functions): add peg-based parsing

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: support returning toolcalls directly from backends

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: do run PEG only if backend didn't send deltas

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-08 22:21:57 +01:00
Ettore Di Giacinto
ac48867b7d feat: add agentic management (#8820)
* feat: add standalone and agentic functionalities

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose agents via responses api

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-03-07 00:03:08 +01:00
LocalAI [bot]
ab315f2725 feat: Add LOCALAI_DISABLE_MCP environment variable to disable MCP support (#8816)
* feat: Add LOCALAI_DISABLE_MCP environment variable to disable MCP support

- Added DisableMCP field to RunCMD struct in core/cli/run.go
- Added LOCALAI_DISABLE_MCP environment variable support
- Added DisableMCP field to ApplicationConfig struct
- Added DisableMCP AppOption function
- Updated MCP endpoint routing to check appConfig.DisableMCP
- When LOCALAI_DISABLE_MCP is set to true/1/yes, MCP endpoints are not registered

When set, all MCP functionality is disabled and appropriate error messages
are returned to users.

Use Cases:
- Security-conscious deployments where MCP is not needed
- Reducing attack surface
- Compliance requirements that prohibit certain protocol support

Environment variable: LOCALAI_DISABLE_MCP=true

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>

* docs: Add documentation for LOCALAI_DISABLE_MCP environment variable

- Add section explaining how to disable MCP support using environment variable
- Document use cases for disabling MCP
- Provide examples for CLI and Docker usage

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>

---------

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>
2026-03-06 20:44:03 +01:00
LocalAI [bot]
9fc77909e0 fix: Add vllm-omni backend to video generation model detection (#8659) (#8781)
fix: Add vllm-omni backend to video generation model detection

- Include vllm-omni in the list of backends that support FLAG_VIDEO
- This allows models like vllm-omni-wan2.2-t2v to appear in the video model selector UI
- Fixes issue #8659 where video generation models using vllm-omni backend were not showing in the dropdown

Co-authored-by: team-coding-agent-1 <team-coding-agent-1@localai.dev>
2026-03-05 01:04:47 +01:00
LocalAI [bot]
6d182281cf fix: allow reranking models configured with known_usecases (#8681)
When a model is configured with 'known_usecases: [rerank]' in the YAML
config, the reranking endpoint was not being matched because:
1. The GuessUsecases function only checked for backend == 'rerankers'
2. The syncKnownUsecasesFromString() was not being called when loading
   configs via yaml.Unmarshal in readModelConfigsFromFile

This fix:
1. Updates GuessUsecases to also check if Reranking is explicitly set to
   true in the model config (in addition to checking backend type)
2. Adds syncKnownUsecasesFromString() calls after yaml.Unmarshal in
   readModelConfigsFromFile to ensure known_usecases are properly parsed

Fixes #8658

Signed-off-by: localai-bot <localai-bot@users.noreply.github.com>
Co-authored-by: localai-bot <localai-bot@users.noreply.github.com>
2026-03-02 19:00:18 +01:00
Andres
e45d63c86e fix(cli): Fix watchdog running constantly and spamming logs (#8624)
* Fix watchdog running constantly and spamming logs

Signed-off-by: Andres Smith <andressmithdev@pm.me>

* Update docs

Signed-off-by: Andres Smith <andressmithdev@pm.me>

---------

Signed-off-by: Andres Smith <andressmithdev@pm.me>
2026-02-23 11:57:28 +01:00
Ettore Di Giacinto
b471619ad9 chore(deps): bump cogito and add new options to the agent config (#8601)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-18 22:10:26 +01:00
Richard Palethorpe
7270a98ce5 fix(realtime): Use user provided voice and allow pipeline models to have no backend (#8415)
* fix(realtime): Use the voice provided by the user or none at all

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* fix(ui,config): Allow pipeline models to have no backend and use same validation in frontend

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
Co-authored-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-02-11 14:18:05 +01:00
Ettore Di Giacinto
53276d28e7 feat(musicgen): add ace-step and UI interface (#8396)
* feat(musicgen): add ace-step and UI interface

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Correctly handle model dir

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop auto-download

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add to models, fixup UIs icons

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* l4t13 is incompatbile

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* avoid pinning version for cuda12

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop l4t12

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-02-05 12:04:53 +01:00
Ettore Di Giacinto
800f749c7b fix: drop gguf VRAM estimation (now redundant) (#8325)
fix: drop gguf VRAM estimation

Cleanup. This is now handled directly in llama.cpp, no need to estimate from Go.

VRAM estimation in general is tricky, but llama.cpp ( 41ea26144e/src/llama.cpp (L168) ) lately has added an automatic "fitting" of models to VRAM, so we can drop backend-specific GGUF VRAM estimation from our code instead of trying to guess as we already enable it

 397f7f0862/backend/cpp/llama-cpp/grpc-server.cpp (L393)

Fixes: https://github.com/mudler/LocalAI/issues/8302
See: https://github.com/mudler/LocalAI/issues/8302#issuecomment-3830773472
2026-02-01 17:33:28 +01:00
Ettore Di Giacinto
26a374b717 chore: drop bark which is unmaintained (#8207)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-25 09:26:40 +01:00
Ettore Di Giacinto
c0b21a921b feat: detect thinking support from backend automatically if not explicitly set (#8167)
detect thinking support from backend automatically if not explicitly set

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-23 00:38:28 +01:00
Ettore Di Giacinto
34e054f607 fix(reasoning): support models with reasoning without starting thinking tag (#8132)
* chore: extract reasoning to its own package

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* make sure we detect thinking tokens from template

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Allow to override via config, add tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-20 21:07:59 +01:00
Ettore Di Giacinto
3387bfaee0 feat(api): add support for open responses specification (#8063)
* feat: openresponses

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add ttl settings, fix tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix: register cors middleware by default

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* satisfy schema

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Logitbias and logprobs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add grammar

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* SSE compliance

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* tool JSON conversion

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* support background mode

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* swagger

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* drop code. This is handled in the handler

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small refactorings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* background mode for MCP

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-01-17 22:11:47 +01:00
lif
8bd7143a44 fix: propagate validation errors (#7787)
fix: validate MCP configuration in model config

Fixes #7334

The Validate() function was not checking if MCP configuration
(mcp.stdio and mcp.remote) contains valid JSON. This caused
malformed JSON with missing commas to be silently accepted.

Changes:
- Add MCP configuration validation to ModelConfig.Validate()
- Properly report validation errors instead of discarding them
- Add test cases for valid and invalid MCP configurations

The fix ensures that malformed JSON in MCP config sections
will now be caught and reported during validation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-30 09:54:27 +01:00
Richard Palethorpe
99b5c5f156 feat(api): Allow tracing of requests and responses (#7609)
* feat(api): Allow tracing of requests and responses

Signed-off-by: Richard Palethorpe <io@richiejp.com>

* feat(traces): Add traces UI

Signed-off-by: Richard Palethorpe <io@richiejp.com>

---------

Signed-off-by: Richard Palethorpe <io@richiejp.com>
2025-12-29 11:06:06 +01:00
Ettore Di Giacinto
c844b7ac58 feat: disable force eviction (#7725)
* feat: allow to set forcing backends eviction while requests are in flight

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat: try to make the request sit and retry if eviction couldn't be done

Otherwise calls that in order to pass would need to shutdown other
backends would just fail.

In this way instead we make the request sit and retry eviction until it
succeeds. The thresholds can be configured by the user.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* add tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* expose settings to CLI

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-25 14:26:18 +01:00
Ettore Di Giacinto
c37785b78c chore(refactor): move logging to common package based on slog (#7668)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-21 19:33:13 +01:00
Ettore Di Giacinto
50f9c9a058 feat(watchdog): add Memory resource reclaimer (#7583)
* feat(watchdog): add GPU reclaimer

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Handle vram calculation for unified memory devices

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Support RAM eviction, set watchdog interval from runtime settings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-16 09:15:18 +01:00
Ettore Di Giacinto
fc5b9ebfcc feat(loader): enhance single active backend to support LRU eviction (#7535)
* feat(loader): refactor single active backend support to LRU

This changeset introduces LRU management of loaded backends. Users can
set now a maximum number of models to be loaded concurrently, and, when
setting LocalAI in single active backend mode we set LRU to 1 for
backward compatibility.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* chore: add tests

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Update docs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-12 12:28:38 +01:00
Ettore Di Giacinto
f51d3e380b fix(config): make syncKnownUsecasesFromString idempotent (#7493)
fix(config): correctly parse usecases from strings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-09 21:08:22 +01:00
Ettore Di Giacinto
6cc5cac7b0 fix(downloader): do not download model files if not necessary (#7492)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-09 19:08:10 +01:00
Ettore Di Giacinto
8a54ffa668 fix: do not require auth for readyz/healthz endpoints (#7403)
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-12-01 10:35:28 +01:00
Ettore Di Giacinto
53e5b2d6be feat: agent jobs panel (#7390)
* feat(agent): agent jobs

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Multiple webhooks, simplify

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Do not use cron with seconds

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Create separate pages for details

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Detect if no models have MCP configuration, show wizard

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Make services test to run

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-11-28 23:05:39 +01:00
Ettore Di Giacinto
2dd42292dc feat(ui): runtime settings (#7320)
* feat(ui): add watchdog settings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Do not re-read env

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Some refactor, move other settings to runtime (p2p)

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Add API Keys handling

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Allow to disable runtime settings

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Documentation

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Small fixups

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* show MCP toggle in index

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* Drop context default

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-11-20 22:37:20 +01:00