* docs: add 'how LocalAI works' architecture diagram Add a blueprint-style architecture diagram: clients -> small core (API, router, WebUI, agents) -> gRPC -> backend processes pulled on demand as OCI images. Place it on the overview page and replace the stale external architecture image on the reference page. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add blueprint diagrams across feature, distributed & getting-started docs Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under docs/static/images/diagrams/, wired into their docs pages, from an impact-vs-effort audit of the docs. Broaden the API surface on the overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama, and LocalAI's own API) and move the gRPC boundary label clear of the arrows. Pages: distributed mode (architecture, scheduling, ds4 layer-split), distributed inferencing, MLX, realtime, quantization, MCP, agents, mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face recognition, reranker, function calling, fine-tuning (recipe + jobs), diarization, audio transform, quickstart, model resolution. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add composable-core diagram to README hero Commit the composable-core card (small core + on-demand backend tiles) alongside the other diagrams and reference it from the README hero via a repo-relative path, so it renders on GitHub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: fix composable-core connectors/badge and federated-vs-worker layout - composable-core: thicken the plug-in connectors so they read clearly, and widen the SEPARATE IMAGE badge so its text no longer overflows the box. - federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and replace the tangled node-to-node activation arrows with a clean fan-out (request split across all sharded nodes), mirroring the federated panel. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
12 KiB
+++ disableToc = false title = "Setting Up Models" weight = 2 icon = "hub" description = "Learn how to install, configure, and manage models in LocalAI" +++
This section covers everything you need to know about installing and configuring models in LocalAI. You'll learn multiple methods to get models running.
Prerequisites
- LocalAI installed and running (see [Quickstart]({{% relref "getting-started/quickstart" %}}) if you haven't set it up yet)
- Basic understanding of command line usage
Method 1: Using the Model Gallery (Easiest)
The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.
Via WebUI
- Open the LocalAI WebUI at
http://localhost:8080 - Navigate to the "Models" tab
- Browse available models
- Click "Install" on any model you want
- Wait for installation to complete
For more details, refer to the [Gallery Documentation]({{% relref "features/model-gallery" %}}).
Via CLI
# List available models
local-ai models list
# Install a specific model
local-ai models install llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with a model from the gallery
local-ai run llama-3.2-1b-instruct:q4_k_m
To run models available in the LocalAI gallery, you can use the model name as the URI. For example, to run LocalAI with the Hermes model, execute:
local-ai run hermes-2-theta-llama-3-8b
To install only the model, use:
local-ai models install hermes-2-theta-llama-3-8b
Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the [Gallery Documentation]({{% relref "features/model-gallery" %}}).
Browse Online
Visit models.localai.io to browse all available models in your browser.
Method 1.5: Import Models via WebUI
The WebUI provides a powerful model import interface that supports both simple and advanced configuration:
Simple Import Mode
- Open the LocalAI WebUI at
http://localhost:8080 - Click "Import Model"
- Enter the model URI (e.g.,
https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF) - Optionally configure preferences:
- Backend selection
- Model name
- Description
- Quantizations
- Embeddings support
- Custom preferences
- Click "Import Model" to start the import process
Advanced Import Mode
For full control over model configuration:
- In the WebUI, click "Import Model"
- Toggle to "Advanced Mode"
- Edit the YAML configuration directly in the code editor
- Use the "Validate" button to check your configuration
- Click "Create" or "Update" to save
The advanced editor includes:
- Syntax highlighting
- YAML validation
- Format and copy tools
- Full configuration options
This is especially useful for:
- Custom model configurations
- Fine-tuning model parameters
- Setting up complex model setups
- Editing existing model configurations
Method 2: Installing from Hugging Face
LocalAI can directly install models from Hugging Face:
# Install and run a model from Hugging Face
local-ai run huggingface://TheBloke/phi-2-GGUF
The format is: huggingface://<repository>/<model-file> ( is optional)
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
Method 3: Installing from OCI Registries
Ollama Registry
local-ai run ollama://gemma:2b
Standard OCI Registry
local-ai run oci://localai/phi-2:latest
Run Models via URI
To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
file://path/to/model(absolute path to a file within your models directory)huggingface://repository_id/model_file(e.g.,huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf)- From OCIs:
oci://container_image:tag,ollama://model_id:tag - From configuration files:
https://gist.githubusercontent.com/.../phi-2.yaml
{{% notice note %}}
When using file:// URLs, the path must point to a file within your models directory (specified by MODELS_PATH). Files outside this directory are rejected for security reasons.
{{% /notice %}}
Configuration files can be used to customize the model defaults and settings. For advanced configurations, refer to the [Customize Models section]({{% relref "getting-started/customize-model" %}}).
Examples
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
local-ai run ollama://gemma:2b
local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
local-ai run oci://localai/phi-2:latest
Method 4: Manual Installation
For full control, you can manually download and configure models.
Step 1: Download a Model
Download a GGUF model file. Popular sources:
Example:
mkdir -p models
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
-O models/phi-2.Q4_K_M.gguf
Step 2: Create a Configuration File (Optional)
Create a YAML file to configure the model:
# models/phi-2.yaml
name: phi-2
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.7
context_size: 2048
threads: 4
backend: llama-cpp
Customize model defaults and settings with a configuration file. For advanced configurations, refer to the [Advanced Documentation]({{% relref "advanced" %}}).
Step 3: Run LocalAI
Choose one of the following methods to run LocalAI:
{{< tabs >}} {{% tab title="Docker" %}}
mkdir models
cp your-model.gguf models/
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
{{% notice tip %}} Other Docker Images:
For other Docker images, please refer to the table in [the container images section]({{% relref "getting-started/container-images" %}}). {{% /notice %}}
Example:
mkdir models
wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q4_0.gguf -O models/luna-ai-llama2
cp -rf prompt-templates/getting_started.tmpl models/luna-ai-llama2.tmpl
docker run -p 8080:8080 -v $PWD/models:/models -ti --rm quay.io/go-skynet/local-ai:latest --models-path /models --context-size 700 --threads 4
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "luna-ai-llama2",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.9
}'
{{% notice note %}}
- If running on Apple Silicon (ARM), it is not recommended to run on Docker due to emulation. Follow the [build instructions]({{% relref "installation/build" %}}) to use Metal acceleration for full GPU support.
- If you are running on Apple x86_64, you can use Docker without additional gain from building it from source. {{% /notice %}}
{{% /tab %}} {{% tab title="Docker Compose" %}}
git clone https://github.com/go-skynet/LocalAI
cd LocalAI
cp your-model.gguf models/
docker compose up -d --pull always
curl http://localhost:8080/v1/models
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "your-model.gguf",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
{{% notice tip %}} Other Docker Images:
For other Docker images, please refer to the table in Getting Started. {{% /notice %}}
Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the Microsoft Docs.
{{% /tab %}} {{% tab title="Kubernetes" %}}
For Kubernetes deployment, see the [Kubernetes installation guide]({{% relref "installation/kubernetes" %}}).
{{% /tab %}} {{% tab title="From Binary" %}}
LocalAI binary releases are available on GitHub.
# With binary
local-ai --models-path ./models
{{% notice tip %}} If installing on macOS, you might encounter a message saying:
"local-ai-git-Darwin-arm64" (or the name you gave the binary) can't be opened because Apple cannot check it for malicious software.
Hit OK, then go to Settings > Privacy & Security > Security and look for the message:
"local-ai-git-Darwin-arm64" was blocked from use because it is not from an identified developer.
Press "Allow Anyway." {{% /notice %}}
{{% /tab %}} {{% tab title="From Source" %}}
For instructions on building LocalAI from source, see the [Build from Source guide]({{% relref "installation/build" %}}).
{{% /tab %}} {{< /tabs >}}
GPU Acceleration
For instructions on GPU acceleration, visit the [GPU Acceleration]({{% relref "features/gpu-acceleration" %}}) page.
For more model configurations, visit the Examples Section.
Understanding Model Files
File Formats
- GGUF: Modern format, recommended for most use cases
- GGML: Older format, still supported but deprecated
Quantization Levels
Models come in different quantization levels (quality vs. size trade-off):
| Quantization | Size | Quality | Use Case |
|---|---|---|---|
| Q8_0 | Largest | Highest | Best quality, requires more RAM |
| Q6_K | Large | Very High | High quality |
| Q4_K_M | Medium | High | Balanced (recommended) |
| Q4_K_S | Small | Medium | Lower RAM usage |
| Q2_K | Smallest | Lower | Minimal RAM, lower quality |
Choosing the Right Model
Consider:
- RAM available: Larger models need more RAM
- Use case: Different models excel at different tasks
- Speed: Smaller quantizations are faster
- Quality: Higher quantizations produce better output
Model Configuration
Basic Configuration
Create a YAML file in your models directory:
name: my-model
parameters:
model: model.gguf
temperature: 0.7
top_p: 0.9
context_size: 2048
threads: 4
backend: llama-cpp
Advanced Configuration
See the [Model Configuration]({{% relref "advanced/model-configuration" %}}) guide for all available options.
Managing Models
List Installed Models
# Via API
curl http://localhost:8080/v1/models
# Via CLI
local-ai models list
Remove Models
Simply delete the model file and configuration from your models directory:
rm models/model-name.gguf
rm models/model-name.yaml # if exists
Troubleshooting
Model Not Loading
-
Check backend: Ensure the required backend is installed
local-ai backends list local-ai backends install llama-cpp # if needed -
Check logs: Enable debug mode
DEBUG=true local-ai -
Verify file: Ensure the model file is not corrupted
Out of Memory
- Use a smaller quantization (Q4_K_S or Q2_K)
- Reduce
context_sizein configuration - Close other applications to free RAM
Wrong Backend
Check the [Compatibility Table]({{% relref "reference/compatibility-table" %}}) to ensure you're using the correct backend for your model.
Best Practices
- Start small: Begin with smaller models to test your setup
- Use quantized models: Q4_K_M is a good balance for most use cases
- Organize models: Keep your models directory organized
- Backup configurations: Save your YAML configurations
- Monitor resources: Watch RAM and disk usage
