* feat(distributed): add per-model ModelLoadInfo persistence
Adds a dedicated ModelLoadInfo table keyed by model name, decoupled from
the per-replica NodeModel rows. The reconciler can now recover model load
metadata after every NodeModel row has been removed (worker death,
eviction, MarkOffline reaping, frontend restart with stale heartbeats),
which is the read side of Bug-1 from the distributed mode bug hunt.
Registry exposes:
- UpsertModelLoadInfo: ON CONFLICT (model_name) update; last-write-wins,
matching the existing per-replica blob semantics under concurrent
multi-frontend dispatch.
- GetModelLoadInfo: read from the new table first; fall back to the
legacy NodeModel-blob scan for rows written before any frontend in
the cluster ran an UpsertModelLoadInfo (rolling-upgrade transition).
SetNodeModelLoadInfo (per-replica blob) is preserved for backward
compatibility and per-replica diagnostics; the dispatch-path hook in the
next commit calls both.
The new table joins the existing nodes AutoMigrate set under the same
schema-migration advisory lock.
Refs: Bug-1, docs/superpowers/specs/2026-05-24-distributed-mode-bug-hunt-findings.md
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* fix(distributed): persist per-model load info on dispatch
scheduleAndLoad now writes the (backendType, ModelOptions blob) pair to
the new ModelLoadInfo table in addition to the existing per-replica
NodeModel.model_opts_blob field. The per-replica blob still works for
the hot path; the per-model row outlives every NodeModel row going away,
which is what unblocks the reconciler on the read side.
Both writes are best-effort with warn-level logging on failure: a write
miss here just means the reconciler may need a fresh inference request
to repopulate, which is the pre-fix behavior.
Concurrency: two frontends loading the same model at the same time both
fire UpsertModelLoadInfo; ON CONFLICT (model_name) makes the row
converge to whichever commits last. Matches the existing per-replica
blob semantics.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* test(distributed): cover load info persistence and Bug-1 recovery
Adds Ginkgo specs that prove the persistence layer behaves correctly and
that the reconciler actually recovers from the frontend-restart scenario
that was failing in production:
registry_test.go:
- per-model row survives RemoveAllNodeModelReplicas (the bug repro)
- ON CONFLICT (model_name) updates backend type + blob, last-write-wins
- legacy NodeModel-blob fallback still works (rolling-upgrade transition)
- GetModelLoadInfo returns ErrRecordNotFound when both sources are empty
- UpsertModelLoadInfo rejects empty model names
reconciler_test.go:
- Bug-1 end-to-end: with min_replicas=2, no NodeModel rows, but a
ModelLoadInfo row present, one reconcile tick fires two scheduler
calls. Pre-fix this returned "no load info" and the scheduler never
got called until a fresh inference request arrived.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
* docs(distributed): note restart-safe reconciler behavior
Adds a bullet to the Replica Reconciler section explaining that per-model
load metadata is persisted across frontend restarts via the new
model_load_infos PostgreSQL table, so a rolling upgrade no longer needs a
fresh inference request per model before the reconciler can replace dead
replicas.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-7[1m]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
- Drop-in API compatibility — OpenAI, Anthropic, ElevenLabs APIs
- 36+ backends — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
- Any hardware — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- Multi-user ready — API key auth, user quotas, role-based access
- Built-in AI agents — autonomous agents with tool use, RAG, MCP, and skills
- Privacy-first — your data never leaves your infrastructure
Created by Ettore Di Giacinto and maintained by the LocalAI team.
📖 Documentation | 💬 Discord | 💻 Quickstart | 🖼️ Models | ❓FAQ
Guided tour
https://github.com/user-attachments/assets/08cbb692-57da-48f7-963d-2e7b43883c18
Click to see more!
User and auth
https://github.com/user-attachments/assets/228fa9ad-81a3-4d43-bfb9-31557e14a36c
Agents
https://github.com/user-attachments/assets/6270b331-e21d-4087-a540-6290006b381a
Usage metrics per user
https://github.com/user-attachments/assets/cbb03379-23b4-4e3d-bd26-d152f057007f
Fine-tuning and Quantization
https://github.com/user-attachments/assets/5ba4ace9-d3df-4795-b7d4-b0b404ea71ee
WebRTC
https://github.com/user-attachments/assets/ed88e34c-fed3-4b83-8a67-4716a9feeb7b
Quickstart
macOS
Note: The DMG is not signed by Apple. After installing, run:
sudo xattr -d com.apple.quarantine /Applications/LocalAI.app. See #6268 for details.
Containers (Docker, podman, ...)
Already ran LocalAI before? Use
docker start -i local-aito restart an existing container.
CPU only:
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest
NVIDIA GPU:
# CUDA 13
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-13
# CUDA 12
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-gpu-nvidia-cuda-12
# NVIDIA Jetson ARM64 (CUDA 12, for AGX Orin and similar)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64
# NVIDIA Jetson ARM64 (CUDA 13, for DGX Spark)
docker run -ti --name local-ai -p 8080:8080 --gpus all localai/localai:latest-nvidia-l4t-arm64-cuda-13
AMD GPU (ROCm):
docker run -ti --name local-ai -p 8080:8080 --device=/dev/kfd --device=/dev/dri --group-add=video localai/localai:latest-gpu-hipblas
Intel GPU (oneAPI):
docker run -ti --name local-ai -p 8080:8080 --device=/dev/dri/card1 --device=/dev/dri/renderD128 localai/localai:latest-gpu-intel
Vulkan GPU:
docker run -ti --name local-ai -p 8080:8080 localai/localai:latest-gpu-vulkan
Loading models
# From the model gallery (see available models with `local-ai models list` or at https://models.localai.io)
local-ai run llama-3.2-1b-instruct:q4_k_m
# From Huggingface
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
# From the Ollama OCI registry
local-ai run ollama://gemma:2b
# From a YAML config
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
# From a standard OCI registry (e.g., Docker Hub)
local-ai run oci://localai/phi-2:latest
Automatic Backend Detection: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see GPU Acceleration.
For more details, see the Getting Started guide.
Latest News
- April 2026: Voice recognition, Face recognition, identification & liveness detection, Ollama API compatibility, Video generation in stable-diffusion.ggml, Backend versioning with auto-upgrade, Pin models & load-on-demand toggle, Universal model importer, new backends: sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad multimodal
- March 2026: Agent management, New React UI, WebRTC, MLX-distributed via P2P and RDMA, MCP Apps, MCP Client-side
- February 2026: Realtime API for audio-to-audio with tool calling, ACE-Step 1.5 support
- January 2026: LocalAI 3.10.0 — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. Release notes
- December 2025: Dynamic Memory Resource reclaimer, Automatic multi-GPU model fitting (llama.cpp), Vibevoice backend
- November 2025: Import models via URL, Multiple chats and history
- October 2025: Model Context Protocol (MCP) support for agentic capabilities
- September 2025: New Launcher for macOS and Linux, extended backend support for Mac and Nvidia L4T, MLX-Audio, WAN 2.2
- August 2025: MLX, MLX-VLM, Diffusers, llama.cpp now supported on Apple Silicon
- July 2025: All backends migrated outside the main binary — lightweight, modular architecture
For older news and full release notes, see GitHub Releases and the News page.
Features
- Text generation (
llama.cpp,transformers,vllm... and more) - Text to Audio
- Audio to Text
- Image generation
- OpenAI-compatible tools API
- Realtime API (Speech-to-speech)
- Embeddings generation
- Constrained grammars
- Download models from Huggingface
- Vision API
- Object Detection
- Reranker API
- P2P Inferencing
- Distributed Mode — Horizontal scaling with PostgreSQL + NATS
- Model Context Protocol (MCP)
- Built-in Agents — Autonomous AI agents with tool use, RAG, skills, SSE streaming, and Agent Hub
- Backend Gallery — Install/remove backends on the fly via OCI images
- Voice Activity Detection (Silero-VAD)
- Integrated WebUI
Supported Backends & Acceleration
LocalAI supports 36+ backends including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for NVIDIA (CUDA 12/13), AMD (ROCm), Intel (oneAPI/SYCL), Apple Silicon (Metal), Vulkan, and NVIDIA Jetson (L4T). All backends can be installed on-the-fly from the Backend Gallery.
See the full Backend & Model Compatibility Table and GPU Acceleration guide.
Resources
- Documentation
- LLM fine-tuning guide
- Build from source
- Kubernetes installation
- Integrations & community projects
- Installation video walkthrough
- Media & blog posts
- Examples
Team
LocalAI is maintained by a small team of humans, together with the wider community of contributors.
- Ettore Di Giacinto — original author and project lead
- Richard Palethorpe — maintainer
A huge thank you to everyone who contributes code, reviews PRs, files issues, and helps users in Discord — LocalAI is a community-driven project and wouldn't exist without you. See the full contributors list.
Citation
If you utilize this repository, data in a downstream project, please consider citing it with:
@misc{localai,
author = {Ettore Di Giacinto},
title = {LocalAI: The free, Open source OpenAI alternative},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/go-skynet/LocalAI}},
Sponsors
Do you find LocalAI useful?
Support the project by becoming a backer or sponsor. Your logo will show up here with a link to your website.
A huge thank you to our generous sponsors who support this project covering CI expenses, and our Sponsor list:
Individual sponsors
A special thanks to individual sponsors, a full list is on GitHub and buymeacoffee. Special shout out to drikster80 for being generous. Thank you everyone!
Star history
License
LocalAI is a community-driven project created by Ettore Di Giacinto and maintained by the LocalAI team.
MIT - Author Ettore Di Giacinto mudler@localai.io
Acknowledgements
LocalAI couldn't have been built without the help of great software already available from the community. Thank you!
- llama.cpp
- https://github.com/tatsu-lab/stanford_alpaca
- https://github.com/cornelk/llama-go for the initial ideas
- https://github.com/antimatter15/alpaca.cpp
- https://github.com/EdVince/Stable-Diffusion-NCNN
- https://github.com/ggerganov/whisper.cpp
- https://github.com/rhasspy/piper
- exo for the MLX distributed auto-parallel sharding implementation
Contributors
This is a community project, a special thanks to our contributors!
