LocalAI

mirror of https://github.com/mudler/LocalAI.git synced 2026-08-02 19:40:11 -04:00

Go to file

Ettore Di Giacinto 6b63b47f61 feat(distributed): support multiple replicas of one model on the same node (#9583 )

* feat(distributed): support multiple replicas of one model on the same node

The distributed scheduler implicitly assumed `(node_id, model_name)` was
unique, but the schema didn't enforce it and the worker keyed all gRPC
processes by model name alone. With `MinReplicas=2` against a single
worker, the reconciler "scaled up" every 30s but the registry never
advanced past 1 row — the worker re-loaded the model in-place every tick
until VRAM fragmented and the gRPC process died.

This change introduces multi-replica-per-node as a first-class concept,
with capacity-aware scheduling, a circuit breaker, and VRAM
soft-reservation. Operators can declare per-node capacity via the worker
flag `--max-replicas-per-model` (mirrored as auto-label
`node.replica-slots=N`) or override per-node from the UI.

* Schema: BackendNode gains MaxReplicasPerModel (default 1) and
ReservedVRAM. NodeModel gains ReplicaIndex (composite with node_id +
model_name). ModelSchedulingConfig gains UnsatisfiableUntil/Ticks for
the reconciler circuit breaker.

* Registry: replica_index threaded through SetNodeModel, RemoveNodeModel,
IncrementInFlight, DecrementInFlight, TouchNodeModel, GetNodeModel,
SetNodeModelLoadInfo and the InFlightTrackingClient. New helpers:
CountReplicasOnNode, NextFreeReplicaIndex (with ErrNoFreeSlot),
RemoveAllNodeModelReplicas, FindNodesWithFreeSlot,
ClusterCapacityForModel, ReserveVRAM/ReleaseVRAM (atomic UPDATE with
ErrInsufficientVRAM), and the unsatisfiable-flag CRUD.

* Worker: processKey now `<modelID>#<replicaIndex>` so concurrent loads
of the same model land on distinct ports. Adds CLI flag
--max-replicas-per-model (env LOCALAI_MAX_REPLICAS_PER_MODEL, default 1)
and emits the auto-label.

* Router: scheduleNewModel filters candidates by free slot, allocates the
replica index, and soft-reserves VRAM before installing the backend.
evictLRUAndFreeNode now deletes the targeted row by ID instead of all
replicas of the model on the node — fixes a latent bug where evicting
one replica orphaned its siblings.

* Reconciler: caps scale-up at ClusterCapacityForModel so a misconfig
(MinReplicas > capacity) doesn't loop forever. After 3 consecutive
ticks of capacity==0 it sets UnsatisfiableUntil for a 5m cooldown and
emits a warning. ClearAllUnsatisfiable fires from Register,
ApproveNode, SetNodeLabel(s), RemoveNodeLabel and
UpdateMaxReplicasPerModel so a new node joining or label changes wake
the reconciler immediately. scaleDownIdle removes highest-replica-index
first to keep slots compact.

* Heartbeat resets reserved_vram to 0 — worker is the source of truth
for actual free VRAM; the reservation is only for the in-tick race
window between two scheduling decisions.

* Probe path (reconciler.probeLoadedModels and health.doCheckAll) now
pass the row's replica_index to RemoveNodeModel so an unreachable
replica doesn't orphan healthy siblings.

* Admin override: PUT /api/nodes/:id/max-replicas-per-model sets a
sticky override (preserved across worker re-registration). DELETE
clears the override so the worker's flag applies again on next
register. Required because Kong defaults the worker flag to 1, so
every worker restart would have silently reverted the UI value.

* React UI: always-visible slot badge on the node row (muted at default
1, accented when >1); inline editor in the expanded drawer with
pencil-to-edit, Save/Cancel, Esc/Enter, "(override)" indicator when
the value is admin-set, and a "Reset" button to hand control back to
the worker. Soft confirm when shrinking the cap below the count of
loaded replicas. Scheduling rules table gets an "Unsatisfiable until
HH:MM" status badge surfacing the cooldown.

* node.replica-slots filtered out of the labels strip on the row to
avoid duplicating the slot badge.

23 new Ginkgo specs (registry, reconciler, inflight, health) cover:
multi-replica row independence, RemoveNodeModel of one replica
preserving siblings, NextFreeReplicaIndex slot allocation including
ErrNoFreeSlot, capacity-gated scale-up with circuit breaker tripping
and recovery on Register, scheduleDownIdle ordering, ClusterCapacity
math, ReserveVRAM admission gating, Heartbeat reset, override survival
across worker re-registration, and ResetMaxReplicasPerModel handing
control back. Plus 8 stdlib tests for the worker processKey / CLI /
auto-label.

Closes the flap reproduced on Qwen3.6-35B against the nvidia-thor
worker (single 128 GiB node, MinReplicas=2): the reconciler now caps
the scale-up at the cluster's actual capacity instead of looping.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Read] [Edit] [Bash] [Skill:critique] [Skill:audit] [Skill:polish] [Skill:golang-testing]

* refactor(react-ui/nodes): tighten capacity editor copy + adopt ActionMenu for row actions

* Capacity editor hint trimmed from operator-doc-style ("Sourced from
the worker's `--max-replicas-per-model` flag. Changing it here makes it
a sticky admin override that survives worker restarts." → "Saved
values stick across worker restarts.") and the override-state copy
similarly compressed. The full mechanic is no longer needed in the UI
— the override pill carries the meaning and the docs cover the rest.

* Node row actions migrated from an inline cluster of icon buttons
(Drain / Resume / Trash) to the kebab ActionMenu used by /manage for
per-row model actions, so dense Nodes tables stay clean. Approve
stays as a prominent primary button — it's a stateful admission gate,
not a routine action, and elevating it matches how /manage surfaces
install-time decisions outside the menu.

* The expanded drawer's Labels section now filters node.replica-slots
out of the editable label list. The label is owned by the Capacity
editor above; surfacing it again as an editable label invited
confusion (the Capacity save would clobber any direct edit).

Both backend and agent workers benefit — they share the row rendering
path, so the action menu and label filter apply to both.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:critique] [Skill:audit] [Skill:polish]

* fix(react-ui/nodes): suppress slot badge on agent workers

Agent workers don't load models, so the per-node replica capacity is
inapplicable to them. Showing "1× slots" on agent rows was a tiny
inconsistency from the unified rendering path — gate the badge on
node_type !== 'agent' so it only appears on backend workers.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp]

* refactor(react-ui/nodes): distill expanded drawer + restyle scheduling form

The expanded node drawer used to stack five panels — slot badge,
filled capacity box, Loaded Models h4+empty-state, Installed Backends
h4+empty-state, Labels h4+chips+form — making routine inspections feel
like a control panel. The scheduling rule form wrapped its mode toggle
as two 50%-width filled buttons that competed visually with the actual
primary action.

* Drawer: collapse three rarely-touched config zones (Capacity,
Backends, Labels) into one `<details>` "Manage" disclosure (closed by
default) with small uppercase eyebrow labels for each zone instead of
parallel h4 sub-headings. Loaded Models stays as the at-a-glance
headline with a single-line empty hint instead of a boxed empty state.
CapacityEditor renders flat (no filled background) — the Manage
disclosure provides framing.

* Scheduling form: replace the chunky 50%-width button-tabs with the
project's existing `.segmented` control (icon + label, sized to
content). Mode hint becomes a single tied line below. Fields stack
vertically with helper text under inputs and a hairline divider above
the right-aligned Save / Cancel.

The empty drawer collapses from ~5 stacked sections (~280px tall) to
two lines (~80px). The scheduling form now reads as a designed dialog
instead of raw building blocks. Both surfaces now match the typographic
density and weight of the rest of the admin pages.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [chrome-devtools-mcp] [Skill:distill] [Skill:audit] [Skill:polish]

* feat(react-ui/nodes): replace scheduling form's model picker with searchable combobox

The native <select> made operators scroll through every gallery entry to
find a model name. The project already has SearchableModelSelect (used
in Studio/Talk/etc.) which combines free-text search with the gallery
list and accepts typed model names that aren't installed yet — useful
for pre-staging a scheduling rule before the node it'll run on has
finished bootstrapping.

Also drops the now-unused useModels import (the combobox manages the
gallery hook internally).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit]

* refactor(react-ui/nodes): consolidate key/value chip editor + add replica preset chips

The Nodes page was rendering the same key=value chip pattern in two
places with subtly different markup: the Labels editor in the expanded
drawer and (post-distill) the Node Selector input in the scheduling
form. The form's input was also a comma-separated string that operators
were getting wrong.

* Extract <KeyValueChips> as a fully controlled chip-builder. Parent
owns the map and decides what onAdd/onRemove does — form state for the
scheduling form, API calls for the live drawer Labels editor. Same
visuals everywhere; one component to change when polish needs apply.

* Replace the comma-separated Node Selector text input with KeyValueChips.
Operators were copying syntax from docs and missing commas; the chip
vocabulary makes the key=value structure self-documenting.

* Add <ReplicaInput>: numeric input + quick-pick preset chips for Min/Max
replicas. Picked over a slider because replica counts are exact specs
derived from VRAM math (operator decision, not a fuzzy estimate). The
chips give one-click access to common values (1/2/3/4 for Min,
0=no-limit/2/4/8 for Max) without the slider's special-value problem
(MaxReplicas=0 is categorical, not a position on a continuum).

* Drop the now-unused labelInputs state in the Nodes page (the inline
label editor's per-node draft state lived there and is now owned by
KeyValueChips).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Skill:distill]

* test: fix CI fallout from multi-replica refactor (e2e/distributed + playwright)

Two breakages caught by CI that didn't surface in the local run:

* tests/e2e/distributed/*.go — multiple files used the pre-PR2 registry
signatures for SetNodeModel / IncrementInFlight / DecrementInFlight /
RemoveNodeModel / TouchNodeModel / GetNodeModel / SetNodeModelLoadInfo
and one stale adapter.InstallBackend call in node_lifecycle_test.go.
All updated to pass replicaIndex=0 — these tests don't exercise
multi-replica behavior, they just need to compile against the new
signatures. The chip-builder tests in core/services/nodes/ already
cover the multi-replica logic.

* core/http/react-ui/e2e/nodes-per-node-backend-actions.spec.js — the
drawer's distill refactor moved Backends inside a "Manage" <details>
disclosure that's collapsed by default. The test helper expanded the
node row but never opened Manage, so the per-node backend table was
never in the DOM. Helper now clicks `.node-manage > summary` after
expanding the row.

All 100 playwright tests pass locally; tests/e2e/distributed compiles
clean.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:opus-4-7 [Edit] [Bash]

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

2026-04-27 21:20:05 +02:00

.agents

ci(python-backends): add weekly DEPS_REFRESH cache-buster

2026-04-27 14:21:11 +00:00

.devcontainer

fix: Add named volumes for Windows Docker compatibility (#8661 )

2026-02-26 23:18:53 +01:00

.devcontainer-scripts

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

.github

ci(python-backends): add weekly DEPS_REFRESH cache-buster

2026-04-27 14:21:11 +00:00

.vscode

feat: refactor build process, drop embedded backends (#5875 )

2025-07-22 16:31:04 +02:00

backend

ci(python-backends): add weekly DEPS_REFRESH cache-buster

2026-04-27 14:21:11 +00:00

cmd

feat: Merge repeated log lines in the terminal (#9141 )

2026-03-26 22:16:13 +01:00

configuration

refactor: move remaining api packages to core (#1731 )

2024-03-01 16:19:53 +01:00

core

feat(distributed): support multiple replicas of one model on the same node (#9583 )

2026-04-27 21:20:05 +02:00

custom-ca-certs

feat(certificates): add support for custom CA certificates (#880 )

2023-11-01 20:10:14 +01:00

docs

fix(docs): replace Docsy alert shortcode with Relearn notice

2026-04-25 21:04:31 +00:00

examples

docs: make examples repository link more prominent (#8895 )

2026-03-09 09:26:16 +01:00

gallery

fix(gallery): correct Qwen3.5 typo in qwen3.5-27b-claude-4.6 model override (closes #9362 ) (#9580 )

2026-04-27 09:24:00 +02:00

internal

feat: cleanups, small enhancements

2023-07-04 18:58:19 +02:00

pkg

feat: Log backend exit code (#9581 )

2026-04-27 14:19:18 +02:00

prompt-templates

Requested Changes from GPT4ALL to Luna-AI-Llama2 (#1092 )

2023-09-22 11:22:17 +02:00

scripts

fix: add hipblaslt library (#9541 )

2026-04-24 18:50:03 +02:00

swagger

feat(swagger): update swagger (#9518 )

2026-04-23 23:26:50 +02:00

tests

feat(distributed): support multiple replicas of one model on the same node (#9583 )

2026-04-27 21:20:05 +02:00

.air.toml

feat(ui): chat stats, small visual enhancements (#7223 )

2025-11-10 18:12:07 +01:00

.dockerignore

feat(whisper-cpp): Convert to Purego and add VAD (#6087 )

2025-08-28 17:25:18 +02:00

.editorconfig

feat(stores): Vector store backend (#1795 )

2024-03-22 21:14:04 +01:00

.env

feat(diffusers): add experimental support for sd_embed-style prompt embedding (#8504 )

2026-02-11 22:58:19 +01:00

.gitattributes

chore(linguist): add *.hpp files to linguist-vendored (#4154 )

2024-11-14 14:12:16 +01:00

.gitignore

fix(ui): Add tracing inline settings back and create UI tests (#9027 )

2026-03-16 17:51:06 +01:00

.gitmodules

feat: Add Kokoros backend (#9212 )

2026-04-08 19:23:16 +02:00

.goreleaser.yaml

feat(ui): move to React for frontend (#8772 )

2026-03-05 21:47:12 +01:00

.yamllint

fix: yamlint warnings and errors (#2131 )

2024-04-25 17:25:56 +00:00

AGENTS.md

ci(python-backends): add weekly DEPS_REFRESH cache-buster

2026-04-27 14:21:11 +00:00

CLAUDE.md

fix(realtime): Add functions to conversation history (#8616 )

2026-02-21 19:03:49 +01:00

CONTRIBUTING.md

docs(agents): adopt kernel's AI coding assistants policy

2026-04-19 22:50:54 +00:00

docker-compose.distributed.yaml

fix(distributed): worker container healthcheck always unhealthy

2026-04-27 13:51:57 +00:00

docker-compose.yaml

fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts (#9545 )

2026-04-24 22:02:23 +02:00

Dockerfile

ci: switch image/backend build cache to a dedicated registry image

2026-04-27 13:13:04 +00:00

Entitlements.plist

Feat: OSX Local Codesigning (#1319 )

2023-11-23 15:22:54 +01:00

entrypoint.sh

feat: ⚠️ reduce images size and stop bundling sources (#5721 )

2025-06-26 18:41:38 +02:00

go.mod

feat: Log backend exit code (#9581 )

2026-04-27 14:19:18 +02:00

go.sum

feat: Log backend exit code (#9581 )

2026-04-27 14:19:18 +02:00

LICENSE

chore(docs): update license year

2025-02-15 18:17:15 +01:00

Makefile

[intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (#9543 )

2026-04-24 18:29:10 +02:00

README.md

docs(readme): add April 2026 highlights to Latest News

2026-04-23 20:47:06 +00:00

renovate.json

ci: manually update deps

2023-05-04 15:01:29 +02:00

SECURITY.md

docs: clarify SECURITY.md version support table with specific ranges and EOL dates (#8861 )

2026-03-08 17:58:19 +01:00

webui_static.yaml

feat(ui): move to React for frontend (#8772 )

2026-03-05 21:47:12 +01:00

README.md

LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

Drop-in API compatibility — OpenAI, Anthropic, ElevenLabs APIs
36+ backends — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
Any hardware — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
Multi-user ready — API key auth, user quotas, role-based access
Built-in AI agents — autonomous agents with tool use, RAG, MCP, and skills
Privacy-first — your data never leaves your infrastructure

Created and maintained by Ettore Di Giacinto.

📖 Documentation | 💬 Discord | 💻 Quickstart | 🖼️ Models | ❓FAQ

Guided tour

https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18

Click to see more!

Quickstart

macOS

Note: The DMG is not signed by Apple. After installing, run: sudo xattr -d com.apple.quarantine /Applications/LocalAI.app. See #6268 for details.

Containers (Docker, podman, ...)

Already ran LocalAI before? Use docker start -i local-ai to restart an existing container.

CPU only:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest

NVIDIA GPU:

# CUDA 13
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13

# CUDA 12
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12

# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64

# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13

AMD GPU (ROCm):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas

Intel GPU (oneAPI):

docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel

Vulkan GPU:

docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan

Loading models

# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# From Huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# From the Ollama OCI registry
local-ai run ollama://gemma:2b
# From a YAML config
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# From a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest

Automatic Backend Detection: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see GPU Acceleration.

For more details, see the Getting Started guide.

Latest News

April 2026: Voice recognition, Face recognition, identification & liveness detection, Ollama API compatibility, Video generation in stable-diffusion.ggml, Backend versioning with auto-upgrade, Pin models & load-on-demand toggle, Universal model importer, new backends: sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad multimodal
March 2026: Agent management, New React UI, WebRTC, MLX-distributed via P2P and RDMA, MCP Apps, MCP Client-side
February 2026: Realtime API for audio-to-audio with tool calling, ACE-Step 1.5 support
January 2026: LocalAI 3.10.0 — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. Release notes
December 2025: Dynamic Memory Resource reclaimer, Automatic multi-GPU model fitting (llama.cpp), Vibevoice backend
November 2025: Import models via URL, Multiple chats and history
October 2025: Model Context Protocol (MCP) support for agentic capabilities
September 2025: New Launcher for macOS and Linux, extended backend support for Mac and Nvidia L4T, MLX-Audio, WAN 2.2
August 2025: MLX, MLX-VLM, Diffusers, llama.cpp now supported on Apple Silicon
July 2025: All backends migrated outside the main binary — lightweight, modular architecture

For older news and full release notes, see GitHub Releases and the News page.

Features

Text generation (llama.cpp, transformers, vllm ... and more)
Text to Audio
Audio to Text
Image generation
OpenAI-compatible tools API
Realtime API (Speech-to-speech)
Embeddings generation
Constrained grammars
Download models from Huggingface
Vision API
Object Detection
Reranker API
P2P Inferencing
Distributed Mode — Horizontal scaling with PostgreSQL + NATS
Model Context Protocol (MCP)
Built-in Agents — Autonomous AI agents with tool use, RAG, skills, SSE streaming, and Agent Hub
Backend Gallery — Install/remove backends on the fly via OCI images
Voice Activity Detection (Silero-VAD)
Integrated WebUI

Supported Backends & Acceleration

LocalAI supports 36+ backends including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for NVIDIA (CUDA 12/13), AMD (ROCm), Intel (oneAPI/SYCL), Apple Silicon (Metal), Vulkan, and NVIDIA Jetson (L4T). All backends can be installed on-the-fly from the Backend Gallery.

See the full Backend & Model Compatibility Table and GPU Acceleration guide.

Resources

Autonomous Development Team

LocalAI is helped being maintained by a team of autonomous AI agents led by an AI Scrum Master.

Live Reports: reports.localai.io
Project Board: Agent task tracking
Blog Post: Learn about the experiment

Citation

If you utilize this repository, data in a downstream project, please consider citing it with:

@misc{localai,
  author = {Ettore Di Giacinto},
  title = {LocalAI: The free, Open source OpenAI alternative},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/go-skynet/LocalAI}},

Star history

License

LocalAI is a community-driven project created by Ettore Di Giacinto.

MIT - Author Ettore Di Giacinto mudler@localai.io

Acknowledgements

LocalAI couldn't have been built without the help of great software already available from the community. Thank you!

llama.cpp
https://github.com/tatsu-lab/stanford_alpaca
https://github.com/cornelk/llama-go for the initial ideas
https://github.com/antimatter15/alpaca.cpp
https://github.com/EdVince/Stable-Diffusion-NCNN
https://github.com/ggerganov/whisper.cpp
https://github.com/rhasspy/piper
exo for the MLX distributed auto-parallel sharding implementation

Contributors

This is a community project, a special thanks to our contributors!

Languages

Go 67.7%

JavaScript 11.4%

Python 5.3%

C++ 5.2%

HTML 4.1%

Other 6.3%

README.md

Guided tour

User and auth

Agents

Usage metrics per user

Fine-tuning and Quantization

WebRTC

Quickstart

macOS

Containers (Docker, podman, ...)

CPU only:

NVIDIA GPU:

AMD GPU (ROCm):

Intel GPU (oneAPI):

Vulkan GPU:

Loading models

Latest News

Features

Supported Backends & Acceleration

Resources

Autonomous Development Team

Citation

Sponsors

Individual sponsors

Star history

License

Acknowledgements

Contributors