Compare commits

...

1 Commits

Author SHA1 Message Date
Ettore Di Giacinto
3e8a54f4b6 chore(docs): improve
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2025-11-17 19:34:25 +01:00
17 changed files with 3275 additions and 27 deletions

View File

@@ -1,11 +1,38 @@
---
weight: 20
title: "Advanced"
description: "Advanced usage"
icon: settings
lead: ""
date: 2020-10-06T08:49:15+00:00
lastmod: 2020-10-06T08:49:15+00:00
draft: false
images: []
---
+++
disableToc = false
title = "Advanced Configuration"
weight = 20
icon = "settings"
description = "Advanced configuration and optimization for LocalAI"
+++
This section covers advanced configuration, optimization, and fine-tuning options for LocalAI.
## Configuration
- **[Model Configuration]({{% relref "docs/advanced/model-configuration" %}})** - Complete model configuration reference
- **[Advanced Usage]({{% relref "docs/advanced/advanced-usage" %}})** - Advanced configuration options
- **[Installer Options]({{% relref "docs/advanced/installer" %}})** - Installer configuration and options
## Performance & Optimization
- **[Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}})** - Optimize for maximum performance
- **[VRAM Management]({{% relref "docs/advanced/vram-management" %}})** - Manage GPU memory efficiently
## Specialized Topics
- **[Fine-tuning]({{% relref "docs/advanced/fine-tuning" %}})** - Fine-tune models for LocalAI
## Before You Begin
Make sure you have:
- LocalAI installed and running
- Basic understanding of YAML configuration
- Familiarity with your system's resources
## Related Documentation
- [Getting Started]({{% relref "docs/getting-started" %}}) - Installation and basics
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration reference
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Common issues
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimization guide

View File

@@ -0,0 +1,344 @@
+++
disableToc = false
title = "Performance Tuning"
weight = 22
icon = "speed"
description = "Optimize LocalAI for maximum performance"
+++
This guide covers techniques to optimize LocalAI performance for your specific hardware and use case.
## Performance Metrics
Before optimizing, establish baseline metrics:
- **Tokens per second**: Measure inference speed
- **Memory usage**: Monitor RAM and VRAM
- **Latency**: Time to first token and total response time
- **Throughput**: Requests per second
Enable debug mode to see performance stats:
```bash
DEBUG=true local-ai
```
Look for output like:
```
llm_load_tensors: tok/s: 45.23
```
## CPU Optimization
### Thread Configuration
Match threads to CPU cores:
```yaml
# Model configuration
threads: 4 # For 4-core CPU
```
**Guidelines**:
- Use number of physical cores (not hyperthreads)
- Leave 1-2 cores for system
- Too many threads can hurt performance
### CPU Instructions
Enable appropriate CPU instructions:
```bash
# Check available instructions
cat /proc/cpuinfo | grep flags
# Build with optimizations
CMAKE_ARGS="-DGGML_AVX2=ON -DGGML_AVX512=ON" make build
```
### NUMA Optimization
For multi-socket systems:
```yaml
numa: true
```
### Memory Mapping
Enable memory mapping for faster model loading:
```yaml
mmap: true
mmlock: false # Set to true to lock in memory (faster but uses more RAM)
```
## GPU Optimization
### Layer Offloading
Offload as many layers as GPU memory allows:
```yaml
gpu_layers: 35 # Adjust based on GPU memory
f16: true # Use FP16 for better performance
```
**Finding optimal layers**:
1. Start with 20 layers
2. Monitor GPU memory: `nvidia-smi` or `rocm-smi`
3. Gradually increase until near memory limit
4. For maximum performance, offload all layers if possible
### Batch Processing
GPU excels at batch processing. Process multiple requests together when possible.
### Mixed Precision
Use FP16 when supported:
```yaml
f16: true
```
## Model Optimization
### Quantization
Choose appropriate quantization:
| Quantization | Speed | Quality | Memory | Use Case |
|-------------|-------|---------|--------|----------|
| Q8_0 | Slowest | Highest | Most | Maximum quality |
| Q6_K | Slow | Very High | High | High quality |
| Q4_K_M | Medium | High | Medium | **Recommended** |
| Q4_K_S | Fast | Medium | Low | Balanced |
| Q2_K | Fastest | Lower | Least | Speed priority |
### Context Size
Reduce context size for faster inference:
```yaml
context_size: 2048 # Instead of 4096 or 8192
```
**Trade-off**: Smaller context = faster but less conversation history
### Model Selection
Choose models appropriate for your hardware:
- **Small systems (4GB RAM)**: 1-3B parameter models
- **Medium systems (8-16GB RAM)**: 3-7B parameter models
- **Large systems (32GB+ RAM)**: 7B+ parameter models
## Configuration Optimizations
### Sampling Parameters
Optimize sampling for speed:
```yaml
parameters:
temperature: 0.7
top_p: 0.9
top_k: 40
mirostat: 0 # Disable for speed (enabled by default)
```
**Note**: Disabling mirostat improves speed but may reduce quality.
### Prompt Caching
Enable prompt caching for repeated queries:
```yaml
prompt_cache_path: "cache"
prompt_cache_all: true
```
### Parallel Requests
LocalAI supports parallel requests. Configure appropriately:
```yaml
# In model config
parallel_requests: 4 # Adjust based on hardware
```
## Storage Optimization
### Use SSD
Always use SSD for model storage:
- HDD: Very slow model loading
- SSD: Fast loading, better performance
### Disable MMAP on HDD
If stuck with HDD:
```yaml
mmap: false # Loads entire model into RAM
```
### Model Location
Store models on fastest storage:
- Local SSD: Best performance
- Network storage: Slower, but allows sharing
- External drive: Slowest
## System-Level Optimizations
### Process Priority
Increase process priority (Linux):
```bash
nice -n -10 local-ai
```
### CPU Governor
Set CPU to performance mode (Linux):
```bash
# Check current governor
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# Set to performance
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
```
### Disable Swapping
Prevent swapping for better performance:
```bash
# Linux
sudo swapoff -a
# Or set swappiness to 0
echo 0 | sudo tee /proc/sys/vm/swappiness
```
### Memory Allocation
For large models, consider huge pages (Linux):
```bash
# Allocate huge pages
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages
```
## Benchmarking
### Measure Performance
Create a benchmark script:
```python
import time
import requests
start = time.time()
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello"}]
}
)
elapsed = time.time() - start
tokens = response.json()["usage"]["completion_tokens"]
tokens_per_second = tokens / elapsed
print(f"Time: {elapsed:.2f}s")
print(f"Tokens: {tokens}")
print(f"Speed: {tokens_per_second:.2f} tok/s")
```
### Compare Configurations
Test different configurations:
1. Baseline: Default settings
2. Optimized: Your optimizations
3. Measure: Tokens/second, latency, memory
### Load Testing
Test under load:
```bash
# Use Apache Bench or similar
ab -n 100 -c 10 -p request.json -T application/json \
http://localhost:8080/v1/chat/completions
```
## Platform-Specific Tips
### Apple Silicon
- Metal acceleration is automatic
- Use native builds (not Docker) for best performance
- M1/M2/M3 have unified memory - optimize accordingly
### NVIDIA GPUs
- Use CUDA 12 for latest optimizations
- Enable Tensor Cores with appropriate precision
- Monitor with `nvidia-smi` for bottlenecks
### AMD GPUs
- Use ROCm/HIPBLAS backend
- Check ROCm compatibility
- Monitor with `rocm-smi`
### Intel GPUs
- Use oneAPI/SYCL backend
- Check Intel GPU compatibility
- Optimize for F16/F32 precision
## Common Performance Issues
### Slow First Response
**Cause**: Model loading
**Solution**: Pre-load models or use model warming
### Degrading Performance
**Cause**: Memory fragmentation
**Solution**: Restart LocalAI periodically
### Inconsistent Speed
**Cause**: System load, thermal throttling
**Solution**: Monitor system resources, ensure cooling
## Performance Checklist
- [ ] Threads match CPU cores
- [ ] GPU layers optimized
- [ ] Appropriate quantization selected
- [ ] Context size optimized
- [ ] Models on SSD
- [ ] MMAP enabled (if using SSD)
- [ ] Mirostat disabled (if speed priority)
- [ ] System resources monitored
- [ ] Baseline metrics established
- [ ] Optimizations tested and verified
## See Also
- [GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}}) - GPU setup
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - GPU memory
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration options
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Performance issues

View File

@@ -14,7 +14,31 @@ Here are answers to some of the most common questions.
### How do I get models?
Most gguf-based models should work, but newer models may require additions to the API. If a model doesn't work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
There are several ways to get models for LocalAI:
1. **WebUI Import** (Easiest): Use the WebUI's model import interface:
- Open `http://localhost:8080` and navigate to the Models tab
- Click "Import Model" or "New Model"
- Enter a model URI (Hugging Face, OCI, file path, etc.)
- Configure preferences in Simple Mode or edit YAML in Advanced Mode
- The WebUI provides syntax highlighting, validation, and a user-friendly interface
2. **Model Gallery** (Recommended): Use the built-in model gallery accessible via:
- WebUI: Navigate to the Models tab in the LocalAI interface and browse available models
- CLI: `local-ai models list` to see available models, then `local-ai models install <model-name>`
- Online: Browse models at [models.localai.io](https://models.localai.io)
3. **Hugging Face**: Most GGUF-based models from Hugging Face work with LocalAI. You can install them via:
- WebUI: Import using `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`
- CLI: `local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`
4. **Manual Installation**: Download model files and place them in your models directory. See [Install and Run Models]({{% relref "docs/getting-started/models" %}}) for details.
5. **OCI Registries**: Install models from OCI-compatible registries:
- WebUI: Import using `ollama://gemma:2b` or `oci://localai/phi-2:latest`
- CLI: `local-ai run ollama://gemma:2b` or `local-ai run oci://localai/phi-2:latest`
**Security Note**: Be cautious when downloading models from the internet. Always verify the source and use trusted repositories when possible.
### Where are models stored?
@@ -70,7 +94,15 @@ There is GPU support, see {{%relref "docs/features/GPU-acceleration" %}}.
### Where is the webUI?
There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
LocalAI includes a built-in WebUI that is automatically available when you start LocalAI. Simply navigate to `http://localhost:8080` in your web browser after starting LocalAI.
The WebUI provides:
- Chat interface for interacting with models
- Model gallery browser and installer
- Backend management
- Configuration tools
If you prefer a different interface, LocalAI is compatible with any OpenAI-compatible UI. You can find examples in the [LocalAI-examples repository](https://github.com/mudler/LocalAI-examples), including integrations with popular UIs like chatbot-ui.
### Does it work with AutoGPT?
@@ -88,3 +120,96 @@ This typically happens when your prompt exceeds the context size. Try to reduce
### I'm getting a 'SIGILL' error, what's wrong?
Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting `REBUILD=true` and disable the CPU instructions that are not compatible with your CPU. For instance: `CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build`
Alternatively, you can use the backend management system to install a compatible backend for your CPU architecture. See [Backend Management]({{% relref "docs/features/backends" %}}) for more information.
### How do I install backends?
LocalAI now uses a backend management system where backends are automatically downloaded when needed. You can also manually install backends:
```bash
# List available backends
local-ai backends list
# Install a specific backend
local-ai backends install llama-cpp
# Install a backend for a specific GPU type
local-ai backends install llama-cpp --gpu-type nvidia
```
For more details, see the [Backends documentation]({{% relref "docs/features/backends" %}}).
### How do I set up API keys for security?
You can secure your LocalAI instance by setting API keys using the `API_KEY` environment variable:
```bash
# Single API key
API_KEY=your-secret-key local-ai
# Multiple API keys (comma-separated)
API_KEY=key1,key2,key3 local-ai
```
When API keys are set, all requests must include the key in the `Authorization` header:
```bash
curl http://localhost:8080/v1/models \
-H "Authorization: Bearer your-secret-key"
```
**Important**: API keys provide full access to all LocalAI features (admin-level access). Make sure to protect your API keys and use HTTPS when exposing LocalAI remotely.
### My model is not loading or showing errors
Here are common issues and solutions:
1. **Backend not installed**: The required backend may not be installed. Check with `local-ai backends list` and install if needed.
2. **Insufficient memory**: Large models require significant RAM. Check available memory and consider using a smaller quantized model.
3. **Wrong backend specified**: Ensure the backend in your model configuration matches the model type. See the [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}).
4. **Model file corruption**: Re-download the model file.
5. **Check logs**: Enable debug mode (`DEBUG=true`) to see detailed error messages.
For more troubleshooting help, see the [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}).
### How do I use GPU acceleration?
LocalAI supports multiple GPU types:
- **NVIDIA (CUDA)**: Use `--gpus all` with Docker and CUDA-enabled images
- **AMD (ROCm)**: Use images with `hipblas` tag
- **Intel**: Use images with `intel` tag or Intel oneAPI
- **Apple Silicon (Metal)**: Automatically detected on macOS
For detailed setup instructions, see [GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}}).
### Can I use LocalAI with LangChain, AutoGPT, or other frameworks?
Yes! LocalAI is compatible with any framework that supports OpenAI's API. Simply point the framework to your LocalAI endpoint:
```python
# Example with LangChain
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_key="not-needed",
openai_api_base="http://localhost:8080/v1"
)
```
See the [Integrations]({{% relref "docs/integrations" %}}) page for a list of compatible projects and examples.
### What's the difference between AIO images and standard images?
**AIO (All-in-One) images** come pre-configured with:
- Pre-installed models ready to use
- All necessary backends included
- Quick start with no configuration needed
**Standard images** are:
- Smaller in size
- No pre-installed models
- You install models and backends as needed
- More flexible for custom setups
Choose AIO images for quick testing and standard images for production deployments. See [Container Images]({{% relref "docs/getting-started/container-images" %}}) for details.

View File

@@ -1,8 +1,56 @@
+++
disableToc = false
title = "Features"
weight = 8
icon = "feature_search"
url = "/features/"
description = "Explore all LocalAI capabilities and features"
+++
LocalAI provides a comprehensive set of AI capabilities, all running locally with OpenAI-compatible APIs.
## Core Features
### Text Generation
- **[Text Generation]({{% relref "docs/features/text-generation" %}})** - Generate text with various LLMs
- **[OpenAI Functions]({{% relref "docs/features/openai-functions" %}})** - Function calling and tools API
- **[Constrained Grammars]({{% relref "docs/features/constrained_grammars" %}})** - Structured output generation
- **[Model Context Protocol (MCP)]({{% relref "docs/features/mcp" %}})** - Agentic capabilities
### Multimodal
- **[GPT Vision]({{% relref "docs/features/gpt-vision" %}})** - Image understanding and analysis
- **[Image Generation]({{% relref "docs/features/image-generation" %}})** - Create images from text
- **[Object Detection]({{% relref "docs/features/object-detection" %}})** - Detect objects in images
### Audio
- **[Text to Audio]({{% relref "docs/features/text-to-audio" %}})** - Generate speech from text
- **[Audio to Text]({{% relref "docs/features/audio-to-text" %}})** - Transcribe audio to text
### Data & Search
- **[Embeddings]({{% relref "docs/features/embeddings" %}})** - Generate vector embeddings
- **[Reranker]({{% relref "docs/features/reranker" %}})** - Document relevance scoring
- **[Stores]({{% relref "docs/features/stores" %}})** - Vector database storage
## Infrastructure
- **[Backends]({{% relref "docs/features/backends" %}})** - Backend management and installation
- **[GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}})** - GPU support and optimization
- **[Model Gallery]({{% relref "docs/features/model-gallery" %}})** - Browse and install models
- **[Distributed Inferencing]({{% relref "docs/features/distributed_inferencing" %}})** - P2P and distributed inference
## Getting Started with Features
1. **Install LocalAI**: See [Getting Started]({{% relref "docs/getting-started" %}})
2. **Install Models**: See [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})
3. **Try Features**: See [Try It Out]({{% relref "docs/getting-started/try-it-out" %}})
4. **Configure**: See [Advanced Configuration]({{% relref "docs/advanced" %}})
## Related Documentation
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - Supported models and backends
- [Tutorials]({{% relref "docs/tutorials" %}}) - Step-by-step guides

View File

@@ -1,7 +1,49 @@
+++
disableToc = false
title = "Getting started"
title = "Getting Started"
weight = 2
icon = "rocket_launch"
description = "Install LocalAI and run your first AI model"
+++
Welcome to LocalAI! This section will guide you through installation and your first steps.
## Quick Start
**New to LocalAI?** Start here:
1. **[Quickstart]({{% relref "docs/getting-started/quickstart" %}})** - Get LocalAI running in minutes
2. **[Your First Chat]({{% relref "docs/tutorials/first-chat" %}})** - Complete beginner tutorial
3. **[Try It Out]({{% relref "docs/getting-started/try-it-out" %}})** - Test the API with examples
## Installation Options
Choose the installation method that works for you:
- **[Quickstart]({{% relref "docs/getting-started/quickstart" %}})** - Docker, installer, or binaries
- **[Container Images]({{% relref "docs/getting-started/container-images" %}})** - Docker deployment options
- **[Build from Source]({{% relref "docs/getting-started/build" %}})** - Compile LocalAI yourself
- **[Kubernetes]({{% relref "docs/getting-started/kubernetes" %}})** - Deploy on Kubernetes
## Setting Up Models
Once LocalAI is installed:
- **[Install and Run Models]({{% relref "docs/getting-started/models" %}})** - Model installation guide
- **[Setting Up Models Tutorial]({{% relref "docs/tutorials/setting-up-models" %}})** - Step-by-step model setup
- **[Customize Models]({{% relref "docs/getting-started/customize-model" %}})** - Configure model behavior
## What's Next?
After installation:
- Explore [Features]({{% relref "docs/features" %}}) - See what LocalAI can do
- Follow [Tutorials]({{% relref "docs/tutorials" %}}) - Learn step-by-step
- Check [FAQ]({{% relref "docs/faq" %}}) - Common questions
- Read [Documentation]({{% relref "docs" %}}) - Complete reference
## Need Help?
- [FAQ]({{% relref "docs/faq" %}}) - Common questions and answers
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Solutions to problems
- [Discord](https://discord.gg/uJAeKSAGDy) - Community support

View File

@@ -7,6 +7,7 @@ icon = "rocket_launch"
To install models with LocalAI, you can:
- **Import via WebUI** (Recommended for beginners): Use the WebUI's model import interface to import models from URIs with a user-friendly interface. Supports both simple mode (with preferences) and advanced mode (YAML editor). See the [Setting Up Models tutorial]({{% relref "docs/tutorials/setting-up-models" %}}) for details.
- Browse the Model Gallery from the Web Interface and install models with a couple of clicks. For more details, refer to the [Gallery Documentation]({{% relref "docs/features/model-gallery" %}}).
- Specify a model from the LocalAI gallery during startup, e.g., `local-ai run <model_gallery_name>`.
- Use a URI to specify a model file (e.g., `huggingface://...`, `oci://`, or `ollama://`) when starting LocalAI, e.g., `local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`.
@@ -31,9 +32,29 @@ local-ai models install hermes-2-theta-llama-3-8b
Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the [Gallery Documentation]({{% relref "docs/features/model-gallery" %}}).
## Run Models via URI
## Import Models via WebUI
To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
The easiest way to import models is through the WebUI's import interface:
1. Open the LocalAI WebUI at `http://localhost:8080`
2. Navigate to the "Models" tab
3. Click "Import Model" or "New Model"
4. Choose your import method:
- **Simple Mode**: Enter a model URI and configure preferences (backend, name, description, quantizations, etc.)
- **Advanced Mode**: Edit YAML configuration directly with syntax highlighting and validation
The WebUI import supports all URI types:
- `huggingface://repository_id/model_file`
- `oci://container_image:tag`
- `ollama://model_id:tag`
- `file://path/to/model`
- `https://...` (for configuration files)
For detailed instructions, see the [Setting Up Models tutorial]({{% relref "docs/tutorials/setting-up-models" %}}).
## Run Models via URI (CLI)
To run models via URI from the command line, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
- `file://path/to/model`
- `huggingface://repository_id/model_file` (e.g., `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`)
@@ -172,7 +193,7 @@ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d
{{% alert icon="💡" %}}
**Other Docker Images**:
For other Docker images, please refer to the table in [Getting Started](https://localai.io/basics/getting_started/#container-images).
For other Docker images, please refer to the table in [Container Images]({{% relref "docs/getting-started/container-images" %}}).
{{% /alert %}}
Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems).

View File

@@ -70,7 +70,7 @@ You can use Docker for a quick start:
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu
```
For more detailed installation options and configurations, see our [Getting Started guide](/basics/getting_started/).
For more detailed installation options and configurations, see our [Getting Started guide]({{% relref "docs/getting-started/quickstart" %}}).
## One-liner
@@ -104,9 +104,9 @@ LocalAI is a community-driven project. You can:
Ready to dive in? Here are some recommended next steps:
1. [Install LocalAI](/basics/getting_started/)
1. [Install LocalAI]({{% relref "docs/getting-started/quickstart" %}})
2. [Explore available models](https://models.localai.io)
3. [Model compatibility](/model-compatibility/)
3. [Model compatibility]({{% relref "docs/reference/compatibility-table" %}})
4. [Try out examples](https://github.com/mudler/LocalAI-examples)
5. [Join the community](https://discord.gg/uJAeKSAGDy)
6. [Check the LocalAI Github repository](https://github.com/mudler/LocalAI)

View File

@@ -0,0 +1,445 @@
+++
disableToc = false
title = "API Reference"
weight = 22
icon = "api"
description = "Complete API reference for LocalAI's OpenAI-compatible endpoints"
+++
LocalAI provides a REST API that is compatible with OpenAI's API specification. This document provides a complete reference for all available endpoints.
## Base URL
All API requests should be made to:
```
http://localhost:8080/v1
```
For production deployments, replace `localhost:8080` with your server's address.
## Authentication
If API keys are configured (via `API_KEY` environment variable), include the key in the `Authorization` header:
```bash
Authorization: Bearer your-api-key
```
## Endpoints
### Chat Completions
Create a model response for the given chat conversation.
**Endpoint**: `POST /v1/chat/completions`
**Request Body**:
```json
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100,
"top_p": 1.0,
"top_k": 40,
"stream": false
}
```
**Parameters**:
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `model` | string | The model to use | Required |
| `messages` | array | Array of message objects | Required |
| `temperature` | number | Sampling temperature (0-2) | 0.7 |
| `max_tokens` | integer | Maximum tokens to generate | Model default |
| `top_p` | number | Nucleus sampling parameter | 1.0 |
| `top_k` | integer | Top-k sampling parameter | 40 |
| `stream` | boolean | Stream responses | false |
| `tools` | array | Available tools/functions | - |
| `tool_choice` | string | Tool selection mode | "auto" |
**Response**:
```json
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}
```
**Example**:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
### Completions
Create a completion for the provided prompt.
**Endpoint**: `POST /v1/completions`
**Request Body**:
```json
{
"model": "gpt-4",
"prompt": "The capital of France is",
"temperature": 0.7,
"max_tokens": 10
}
```
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | The model to use |
| `prompt` | string | The prompt to complete |
| `temperature` | number | Sampling temperature |
| `max_tokens` | integer | Maximum tokens to generate |
| `top_p` | number | Nucleus sampling |
| `top_k` | integer | Top-k sampling |
| `stream` | boolean | Stream responses |
**Example**:
```bash
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"prompt": "The capital of France is",
"max_tokens": 10
}'
```
### Edits
Create an edited version of the input.
**Endpoint**: `POST /v1/edits`
**Request Body**:
```json
{
"model": "gpt-4",
"instruction": "Make it more formal",
"input": "Hey, how are you?",
"temperature": 0.7
}
```
**Example**:
```bash
curl http://localhost:8080/v1/edits \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"instruction": "Make it more formal",
"input": "Hey, how are you?"
}'
```
### Embeddings
Get a vector representation of input text.
**Endpoint**: `POST /v1/embeddings`
**Request Body**:
```json
{
"model": "text-embedding-ada-002",
"input": "The food was delicious"
}
```
**Response**:
```json
{
"object": "list",
"data": [{
"object": "embedding",
"embedding": [0.1, 0.2, 0.3, ...],
"index": 0
}],
"usage": {
"prompt_tokens": 4,
"total_tokens": 4
}
}
```
**Example**:
```bash
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-ada-002",
"input": "The food was delicious"
}'
```
### Audio Transcription
Transcribe audio into the input language.
**Endpoint**: `POST /v1/audio/transcriptions`
**Request**: `multipart/form-data`
**Form Fields**:
| Field | Type | Description |
|-------|------|-------------|
| `file` | file | Audio file to transcribe |
| `model` | string | Model to use (e.g., "whisper-1") |
| `language` | string | Language code (optional) |
| `prompt` | string | Optional text prompt |
| `response_format` | string | Response format (json, text, etc.) |
**Example**:
```bash
curl http://localhost:8080/v1/audio/transcriptions \
-H "Authorization: Bearer not-needed" \
-F file="@audio.mp3" \
-F model="whisper-1"
```
### Audio Speech (Text-to-Speech)
Generate audio from text.
**Endpoint**: `POST /v1/audio/speech`
**Request Body**:
```json
{
"model": "tts-1",
"input": "Hello, this is a test",
"voice": "alloy",
"response_format": "mp3"
}
```
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `model` | string | TTS model to use |
| `input` | string | Text to convert to speech |
| `voice` | string | Voice to use (alloy, echo, fable, etc.) |
| `response_format` | string | Audio format (mp3, opus, etc.) |
**Example**:
```bash
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, this is a test",
"voice": "alloy"
}' \
--output speech.mp3
```
### Image Generation
Generate images from text prompts.
**Endpoint**: `POST /v1/images/generations`
**Request Body**:
```json
{
"prompt": "A cute baby sea otter",
"n": 1,
"size": "256x256",
"response_format": "url"
}
```
**Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `prompt` | string | Text description of the image |
| `n` | integer | Number of images to generate |
| `size` | string | Image size (256x256, 512x512, etc.) |
| `response_format` | string | Response format (url, b64_json) |
**Example**:
```bash
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A cute baby sea otter",
"size": "256x256"
}'
```
### List Models
List all available models.
**Endpoint**: `GET /v1/models`
**Query Parameters**:
| Parameter | Type | Description |
|-----------|------|-------------|
| `filter` | string | Filter models by name |
| `excludeConfigured` | boolean | Exclude configured models |
**Response**:
```json
{
"object": "list",
"data": [
{
"id": "gpt-4",
"object": "model"
},
{
"id": "gpt-4-vision-preview",
"object": "model"
}
]
}
```
**Example**:
```bash
curl http://localhost:8080/v1/models
```
## Streaming Responses
Many endpoints support streaming. Set `"stream": true` in the request:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
Stream responses are sent as Server-Sent Events (SSE):
```
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
data: [DONE]
```
## Error Handling
### Error Response Format
```json
{
"error": {
"message": "Error description",
"type": "invalid_request_error",
"code": 400
}
}
```
### Common Error Codes
| Code | Description |
|------|-------------|
| 400 | Bad Request - Invalid parameters |
| 401 | Unauthorized - Missing or invalid API key |
| 404 | Not Found - Model or endpoint not found |
| 429 | Too Many Requests - Rate limit exceeded |
| 500 | Internal Server Error - Server error |
| 503 | Service Unavailable - Model not loaded |
### Example Error Handling
```python
import requests
try:
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={"model": "gpt-4", "messages": [...]},
timeout=30
)
response.raise_for_status()
data = response.json()
except requests.exceptions.HTTPError as e:
if e.response.status_code == 404:
print("Model not found")
elif e.response.status_code == 503:
print("Model not loaded")
else:
print(f"Error: {e}")
```
## Rate Limiting
LocalAI doesn't enforce rate limiting by default. For production deployments, implement rate limiting at the reverse proxy or application level.
## Best Practices
1. **Use appropriate timeouts**: Set reasonable timeouts for requests
2. **Handle errors gracefully**: Implement retry logic with exponential backoff
3. **Monitor token usage**: Track `usage` fields in responses
4. **Use streaming for long responses**: Enable streaming for better user experience
5. **Cache embeddings**: Cache embedding results when possible
6. **Batch requests**: Process multiple items together when possible
## See Also
- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference) - Original OpenAI API reference
- [Try It Out]({{% relref "docs/getting-started/try-it-out" %}}) - Interactive examples
- [Integration Examples]({{% relref "docs/tutorials/integration-examples" %}}) - Framework integrations
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - API issues

View File

@@ -0,0 +1,318 @@
+++
disableToc = false
title = "Security Best Practices"
weight = 26
icon = "security"
description = "Security guidelines for deploying LocalAI"
+++
This guide covers security best practices for deploying LocalAI in various environments, from local development to production.
## Overview
LocalAI processes sensitive data and may be exposed to networks. Follow these practices to secure your deployment.
## API Key Protection
### Always Use API Keys in Production
**Never expose LocalAI without API keys**:
```bash
# Set API key
API_KEY=your-secure-random-key local-ai
# Multiple keys (comma-separated)
API_KEY=key1,key2,key3 local-ai
```
### API Key Best Practices
1. **Generate strong keys**: Use cryptographically secure random strings
```bash
# Generate a secure key
openssl rand -hex 32
```
2. **Store securely**:
- Use environment variables
- Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
- Never commit keys to version control
3. **Rotate regularly**: Change API keys periodically
4. **Use different keys**: Different keys for different services/clients
5. **Limit key scope**: Consider implementing key-based rate limiting
### Using API Keys
Include the key in requests:
```bash
curl http://localhost:8080/v1/models \
-H "Authorization: Bearer your-api-key"
```
**Important**: API keys provide full access to all LocalAI features (admin-level). Protect them accordingly.
## Network Security
### Never Expose Directly to Internet
**Always use a reverse proxy** when exposing LocalAI:
```nginx
# nginx example
server {
listen 443 ssl;
server_name localai.example.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
```
### Use HTTPS/TLS
**Always use HTTPS in production**:
1. Obtain SSL/TLS certificates (Let's Encrypt, etc.)
2. Configure reverse proxy with TLS
3. Enforce HTTPS redirects
4. Use strong cipher suites
### Firewall Configuration
Restrict access with firewall rules:
```bash
# Allow only specific IPs (example)
ufw allow from 192.168.1.0/24 to any port 8080
# Or use iptables
iptables -A INPUT -p tcp --dport 8080 -s 192.168.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j DROP
```
### VPN or Private Network
For sensitive deployments:
- Use VPN for remote access
- Deploy on private network only
- Use network segmentation
## Model Security
### Model Source Verification
**Only use trusted model sources**:
1. **Official galleries**: Use LocalAI's model gallery
2. **Verified repositories**: Hugging Face verified models
3. **Verify checksums**: Check SHA256 hashes when provided
4. **Scan for malware**: Scan downloaded files
### Model Isolation
- Run models in isolated environments
- Use containers with limited permissions
- Separate model storage from system
### Model Access Control
- Restrict file system access to models
- Use appropriate file permissions
- Consider read-only model storage
## Container Security
### Use Non-Root User
Run containers as non-root:
```yaml
# Docker Compose
services:
localai:
user: "1000:1000" # Non-root UID/GID
```
### Limit Container Capabilities
```yaml
services:
localai:
cap_drop:
- ALL
cap_add:
- NET_BIND_SERVICE # Only what's needed
```
### Resource Limits
Set resource limits to prevent resource exhaustion:
```yaml
services:
localai:
deploy:
resources:
limits:
cpus: '4'
memory: 16G
```
### Read-Only Filesystem
Where possible, use read-only filesystem:
```yaml
services:
localai:
read_only: true
tmpfs:
- /tmp
- /var/run
```
## Input Validation
### Sanitize Inputs
Validate and sanitize all inputs:
- Check input length limits
- Validate data formats
- Sanitize user prompts
- Implement rate limiting
### File Upload Security
If accepting file uploads:
- Validate file types
- Limit file sizes
- Scan for malware
- Store in isolated location
## Logging and Monitoring
### Secure Logging
- Don't log sensitive data (API keys, user inputs)
- Use secure log storage
- Implement log rotation
- Monitor for suspicious activity
### Monitoring
Monitor for:
- Unusual API usage patterns
- Failed authentication attempts
- Resource exhaustion
- Error rate spikes
## Updates and Maintenance
### Keep Updated
- Regularly update LocalAI
- Update dependencies
- Patch security vulnerabilities
- Monitor security advisories
### Backup Security
- Encrypt backups
- Secure backup storage
- Test restore procedures
- Limit backup access
## Deployment-Specific Security
### Kubernetes
- Use NetworkPolicies
- Implement RBAC
- Use Secrets for sensitive data
- Enable Pod Security Policies
- Use service mesh for mTLS
### Docker
- Use official images
- Scan images for vulnerabilities
- Keep images updated
- Use Docker secrets
- Implement health checks
### Systemd
- Run as dedicated user
- Limit systemd service capabilities
- Use PrivateTmp, ProtectSystem
- Restrict network access
## Security Checklist
Before deploying to production:
- [ ] API keys configured and secured
- [ ] HTTPS/TLS enabled
- [ ] Reverse proxy configured
- [ ] Firewall rules set
- [ ] Network access restricted
- [ ] Container security hardened
- [ ] Resource limits configured
- [ ] Logging configured securely
- [ ] Monitoring in place
- [ ] Updates planned
- [ ] Backup security ensured
- [ ] Incident response plan ready
## Incident Response
### If Compromised
1. **Isolate**: Immediately disconnect from network
2. **Assess**: Determine scope of compromise
3. **Contain**: Prevent further damage
4. **Eradicate**: Remove threats
5. **Recover**: Restore from clean backups
6. **Learn**: Document and improve
### Security Contacts
- Report security issues: [GitHub Security](https://github.com/mudler/LocalAI/security)
- Security discussions: [Discord](https://discord.gg/uJAeKSAGDy)
## Compliance Considerations
### Data Privacy
- Understand data processing
- Implement data retention policies
- Consider GDPR, CCPA requirements
- Document data flows
### Audit Logging
- Log all API access
- Track model usage
- Monitor configuration changes
- Retain logs appropriately
## See Also
- [Deploying to Production]({{% relref "docs/tutorials/deploying-production" %}}) - Production deployment
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - API security
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Security issues
- [FAQ]({{% relref "docs/faq" %}}) - Security questions

View File

@@ -0,0 +1,392 @@
+++
disableToc = false
title = "Troubleshooting Guide"
weight = 25
icon = "bug_report"
description = "Solutions to common problems and issues with LocalAI"
+++
This guide helps you diagnose and fix common issues with LocalAI. If you can't find a solution here, check the [FAQ]({{% relref "docs/faq" %}}) or ask for help on [Discord](https://discord.gg/uJAeKSAGDy).
## Getting Help
Before asking for help, gather this information:
1. **LocalAI version**: `local-ai --version` or check container image tag
2. **System information**: OS, CPU, RAM, GPU (if applicable)
3. **Error messages**: Full error output with `DEBUG=true`
4. **Configuration**: Relevant model configuration files
5. **Logs**: Enable debug mode and capture logs
## Common Issues
### Model Not Loading
**Symptoms**: Model appears in list but fails to load or respond
**Solutions**:
1. **Check backend installation**:
```bash
local-ai backends list
local-ai backends install <backend-name> # if missing
```
2. **Verify model file**:
- Check file exists and is not corrupted
- Verify file format (GGUF recommended)
- Re-download if corrupted
3. **Check memory**:
- Ensure sufficient RAM available
- Try smaller quantization (Q4_K_S instead of Q8_0)
- Reduce `context_size` in configuration
4. **Check logs**:
```bash
DEBUG=true local-ai
```
Look for specific error messages
5. **Verify backend compatibility**:
- Check [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}})
- Ensure correct backend specified in model config
### Out of Memory Errors
**Symptoms**: Errors about memory, crashes, or very slow performance
**Solutions**:
1. **Reduce model size**:
- Use smaller quantization (Q2_K, Q4_K_S)
- Use smaller models (1-3B instead of 7B+)
2. **Adjust configuration**:
```yaml
context_size: 1024 # Reduce from default
gpu_layers: 20 # Reduce GPU layers if using GPU
```
3. **Free system memory**:
- Close other applications
- Reduce number of loaded models
- Use `--single-active-backend` flag
4. **Check system limits**:
```bash
# Linux
free -h
ulimit -a
```
### Slow Performance
**Symptoms**: Very slow responses, low tokens/second
**Solutions**:
1. **Check hardware**:
- Use SSD instead of HDD for model storage
- Ensure adequate CPU cores
- Enable GPU acceleration if available
2. **Optimize configuration**:
```yaml
threads: 4 # Match CPU cores
gpu_layers: 35 # Offload to GPU if available
mmap: true # Enable memory mapping
```
3. **Check for bottlenecks**:
```bash
# Monitor CPU
top
# Monitor GPU (NVIDIA)
nvidia-smi
# Monitor disk I/O
iostat
```
4. **Disable unnecessary features**:
- Set `mirostat: 0` if not needed
- Reduce context size
- Use smaller models
5. **Check network**: If using remote models, check network latency
### GPU Not Working
**Symptoms**: GPU not detected, no GPU usage, or CUDA errors
**Solutions**:
1. **Verify GPU drivers**:
```bash
# NVIDIA
nvidia-smi
# AMD
rocm-smi
```
2. **Check Docker GPU access**:
```bash
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
```
3. **Use correct image**:
- NVIDIA: `localai/localai:latest-gpu-nvidia-cuda-12`
- AMD: `localai/localai:latest-gpu-hipblas`
- Intel: `localai/localai:latest-gpu-intel`
4. **Configure GPU layers**:
```yaml
gpu_layers: 35 # Adjust based on GPU memory
f16: true
```
5. **Check CUDA version**: Ensure CUDA version matches (11.7 vs 12.0)
6. **Check logs**: Enable debug mode to see GPU initialization messages
### API Errors
**Symptoms**: 400, 404, 500, or 503 errors from API
**Solutions**:
1. **404 - Model Not Found**:
- Verify model name is correct
- Check model is installed: `curl http://localhost:8080/v1/models`
- Ensure model file exists in models directory
2. **503 - Service Unavailable**:
- Model may not be loaded yet (wait a moment)
- Check if model failed to load (check logs)
- Verify backend is installed
3. **400 - Bad Request**:
- Check request format matches API specification
- Verify all required parameters are present
- Check parameter types and values
4. **500 - Internal Server Error**:
- Enable debug mode: `DEBUG=true`
- Check logs for specific error
- Verify model configuration is valid
5. **401 - Unauthorized**:
- Check if API key is required
- Verify API key is correct
- Include Authorization header if needed
### Installation Issues
**Symptoms**: Installation fails or LocalAI won't start
**Solutions**:
1. **Docker issues**:
```bash
# Check Docker is running
docker ps
# Check image exists
docker images | grep localai
# Pull latest image
docker pull localai/localai:latest
```
2. **Permission issues**:
```bash
# Check file permissions
ls -la models/
# Fix permissions if needed
chmod -R 755 models/
```
3. **Port already in use**:
```bash
# Find process using port
lsof -i :8080
# Use different port
docker run -p 8081:8080 ...
```
4. **Binary not found**:
- Verify binary is in PATH
- Check binary has execute permissions
- Reinstall if needed
### Backend Issues
**Symptoms**: Backend fails to install or load
**Solutions**:
1. **Check backend availability**:
```bash
local-ai backends list
```
2. **Manual installation**:
```bash
local-ai backends install <backend-name>
```
3. **Check network**: Backend download requires internet connection
4. **Check disk space**: Ensure sufficient space for backend files
5. **Rebuild if needed**:
```bash
REBUILD=true local-ai
```
### Configuration Issues
**Symptoms**: Models not working as expected, wrong behavior
**Solutions**:
1. **Validate YAML syntax**:
```bash
# Check YAML is valid
yamllint model.yaml
```
2. **Check configuration reference**:
- See [Model Configuration]({{% relref "docs/advanced/model-configuration" %}})
- Verify all parameters are correct
3. **Test with minimal config**:
- Start with basic configuration
- Add parameters one at a time
4. **Check template files**:
- Verify template syntax
- Check template matches model type
## Debugging Tips
### Enable Debug Mode
```bash
# Environment variable
DEBUG=true local-ai
# Command line flag
local-ai --debug
# Docker
docker run -e DEBUG=true ...
```
### Check Logs
```bash
# Docker logs
docker logs local-ai
# Systemd logs
journalctl -u localai -f
# Direct output
local-ai 2>&1 | tee localai.log
```
### Test API Endpoints
```bash
# Health check
curl http://localhost:8080/healthz
# Readiness check
curl http://localhost:8080/readyz
# List models
curl http://localhost:8080/v1/models
# Test chat
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "test"}]}'
```
### Monitor Resources
```bash
# CPU and memory
htop
# GPU (NVIDIA)
watch -n 1 nvidia-smi
# Disk usage
df -h
du -sh models/
# Network
iftop
```
## Performance Issues
### Slow Inference
1. **Check token speed**: Look for tokens/second in debug logs
2. **Optimize threads**: Match CPU cores
3. **Enable GPU**: Use GPU acceleration
4. **Reduce context**: Smaller context = faster inference
5. **Use quantization**: Q4_K_M is good balance
### High Memory Usage
1. **Use smaller models**: 1-3B instead of 7B+
2. **Lower quantization**: Q2_K uses less memory
3. **Reduce context size**: Smaller context = less memory
4. **Disable mmap**: Set `mmap: false` (slower but uses less memory)
5. **Unload unused models**: Only load models you're using
## Platform-Specific Issues
### macOS
- **Quarantine warnings**: See [FAQ]({{% relref "docs/faq" %}})
- **Metal not working**: Ensure Xcode is installed
- **Docker performance**: Consider building from source for better performance
### Linux
- **Permission denied**: Check file permissions and SELinux
- **Missing libraries**: Install required system libraries
- **Systemd issues**: Check service status and logs
### Windows/WSL
- **Slow model loading**: Ensure files are on Linux filesystem
- **GPU access**: May require WSL2 with GPU support
- **Path issues**: Use forward slashes in paths
## Getting More Help
If you've tried the solutions above and still have issues:
1. **Check GitHub Issues**: Search [GitHub Issues](https://github.com/mudler/LocalAI/issues)
2. **Ask on Discord**: Join [Discord](https://discord.gg/uJAeKSAGDy)
3. **Create an Issue**: Provide all debugging information
4. **Check Documentation**: Review relevant documentation sections
## See Also
- [FAQ]({{% relref "docs/faq" %}}) - Common questions
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize performance
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - GPU memory management
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration reference

View File

@@ -0,0 +1,34 @@
+++
disableToc = false
title = "Tutorials"
weight = 5
icon = "school"
description = "Step-by-step guides to help you get started with LocalAI"
+++
Welcome to the LocalAI tutorials section! These step-by-step guides will help you learn how to use LocalAI effectively, from your first chat to deploying in production.
## Getting Started Tutorials
Start here if you're new to LocalAI:
1. **[Your First Chat]({{% relref "docs/tutorials/first-chat" %}})** - Learn how to install LocalAI and have your first conversation with an AI model
2. **[Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})** - A comprehensive guide to installing and configuring models
3. **[Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}})** - Set up GPU support for faster inference
## Advanced Tutorials
Ready to take it further?
4. **[Deploying to Production]({{% relref "docs/tutorials/deploying-production" %}})** - Best practices for running LocalAI in production environments
5. **[Integration Examples]({{% relref "docs/tutorials/integration-examples" %}})** - Learn how to integrate LocalAI with popular frameworks and tools
## What's Next?
After completing the tutorials, explore:
- [Features Documentation]({{% relref "docs/features" %}}) - Detailed information about all LocalAI capabilities
- [Advanced Configuration]({{% relref "docs/advanced" %}}) - Fine-tune your setup
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}) - Solutions to common problems

View File

@@ -0,0 +1,355 @@
+++
disableToc = false
title = "Deploying to Production"
weight = 4
icon = "rocket_launch"
description = "Best practices for running LocalAI in production environments"
+++
This tutorial covers best practices for deploying LocalAI in production environments, including security, performance, monitoring, and reliability considerations.
## Prerequisites
- LocalAI installed and tested
- Understanding of your deployment environment
- Basic knowledge of Docker, Kubernetes, or your chosen deployment method
## Security Considerations
### 1. API Key Protection
**Always use API keys in production**:
```bash
# Set API key
API_KEY=your-secure-random-key local-ai
# Or multiple keys
API_KEY=key1,key2,key3 local-ai
```
**Best Practices**:
- Use strong, randomly generated keys
- Store keys securely (environment variables, secrets management)
- Rotate keys regularly
- Use different keys for different services/clients
### 2. Network Security
**Never expose LocalAI directly to the internet** without protection:
- Use a reverse proxy (nginx, Traefik, Caddy)
- Enable HTTPS/TLS
- Use firewall rules to restrict access
- Consider VPN or private network access only
**Example nginx configuration**:
```nginx
server {
listen 443 ssl;
server_name localai.example.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location / {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
```
### 3. Resource Limits
Set appropriate resource limits to prevent resource exhaustion:
```yaml
# Docker Compose example
services:
localai:
deploy:
resources:
limits:
cpus: '4'
memory: 16G
reservations:
cpus: '2'
memory: 8G
```
## Deployment Methods
### Docker Compose (Recommended for Small-Medium Deployments)
```yaml
version: '3.8'
services:
localai:
image: localai/localai:latest
ports:
- "8080:8080"
environment:
- API_KEY=${API_KEY}
- DEBUG=false
- MODELS_PATH=/models
volumes:
- ./models:/models
- ./config:/config
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
memory: 16G
```
### Kubernetes
See the [Kubernetes Deployment Guide]({{% relref "docs/getting-started/kubernetes" %}}) for detailed instructions.
**Key considerations**:
- Use ConfigMaps for configuration
- Use Secrets for API keys
- Set resource requests and limits
- Configure health checks and liveness probes
- Use PersistentVolumes for model storage
### Systemd Service (Linux)
Create a systemd service file:
```ini
[Unit]
Description=LocalAI Service
After=network.target
[Service]
Type=simple
User=localai
Environment="API_KEY=your-key"
Environment="MODELS_PATH=/var/lib/localai/models"
ExecStart=/usr/local/bin/local-ai
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
```
## Performance Optimization
### 1. Model Selection
- Use quantized models (Q4_K_M) for production
- Choose models appropriate for your hardware
- Consider model size vs. quality trade-offs
### 2. Resource Allocation
```yaml
# Model configuration
name: production-model
parameters:
model: model.gguf
context_size: 2048 # Adjust based on needs
threads: 4 # Match CPU cores
gpu_layers: 35 # If using GPU
```
### 3. Caching
Enable prompt caching for repeated queries:
```yaml
prompt_cache_path: "cache"
prompt_cache_all: true
```
### 4. Connection Pooling
If using a reverse proxy, configure connection pooling:
```nginx
upstream localai {
least_conn;
server localhost:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
```
## Monitoring and Logging
### 1. Health Checks
LocalAI provides health check endpoints:
```bash
# Readiness check
curl http://localhost:8080/readyz
# Health check
curl http://localhost:8080/healthz
```
### 2. Logging
Configure appropriate log levels:
```bash
# Production: minimal logging
DEBUG=false local-ai
# Development: detailed logging
DEBUG=true local-ai
```
### 3. Metrics
Monitor key metrics:
- Request rate
- Response times
- Error rates
- Resource usage (CPU, memory, GPU)
- Model loading times
### 4. Alerting
Set up alerts for:
- Service downtime
- High error rates
- Resource exhaustion
- Slow response times
## High Availability
### 1. Multiple Instances
Run multiple LocalAI instances behind a load balancer:
```yaml
# Docker Compose with multiple instances
services:
localai1:
image: localai/localai:latest
# ... configuration
localai2:
image: localai/localai:latest
# ... configuration
nginx:
image: nginx:alpine
# Load balance between localai1 and localai2
```
### 2. Model Replication
Ensure models are available on all instances:
- Shared storage (NFS, S3, etc.)
- Model synchronization
- Consistent model versions
### 3. Graceful Shutdown
LocalAI supports graceful shutdown. Ensure your deployment method handles SIGTERM properly.
## Backup and Recovery
### 1. Model Backups
Regularly backup your models and configurations:
```bash
# Backup models
tar -czf models-backup-$(date +%Y%m%d).tar.gz models/
# Backup configurations
tar -czf config-backup-$(date +%Y%m%d).tar.gz config/
```
### 2. Configuration Management
Version control your configurations:
- Use Git for YAML configurations
- Document model versions
- Track configuration changes
### 3. Disaster Recovery
Plan for:
- Model storage recovery
- Configuration restoration
- Service restoration procedures
## Scaling Considerations
### Horizontal Scaling
- Run multiple instances
- Use load balancing
- Consider stateless design (shared model storage)
### Vertical Scaling
- Increase resources (CPU, RAM, GPU)
- Use more powerful hardware
- Optimize model configurations
## Maintenance
### 1. Updates
- Test updates in staging first
- Plan maintenance windows
- Have rollback procedures ready
### 2. Model Updates
- Test new models before production
- Keep model versions documented
- Have rollback capability
### 3. Monitoring
Regularly review:
- Performance metrics
- Error logs
- Resource usage trends
- User feedback
## Production Checklist
Before going live, ensure:
- [ ] API keys configured and secured
- [ ] HTTPS/TLS enabled
- [ ] Firewall rules configured
- [ ] Resource limits set
- [ ] Health checks configured
- [ ] Monitoring in place
- [ ] Logging configured
- [ ] Backups scheduled
- [ ] Documentation updated
- [ ] Team trained on operations
- [ ] Incident response plan ready
## What's Next?
- [Kubernetes Deployment]({{% relref "docs/getting-started/kubernetes" %}}) - Deploy on Kubernetes
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize performance
- [Security Best Practices]({{% relref "docs/security" %}}) - Security guidelines
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}) - Production issues
## See Also
- [Container Images]({{% relref "docs/getting-started/container-images" %}})
- [Advanced Configuration]({{% relref "docs/advanced" %}})
- [FAQ]({{% relref "docs/faq" %}})

View File

@@ -0,0 +1,171 @@
+++
disableToc = false
title = "Your First Chat with LocalAI"
weight = 1
icon = "chat"
description = "Get LocalAI running and have your first conversation in minutes"
+++
This tutorial will guide you through installing LocalAI and having your first conversation with an AI model. By the end, you'll have LocalAI running and be able to chat with a local AI model.
## Prerequisites
- A computer running Linux, macOS, or Windows (with WSL)
- At least 4GB of RAM (8GB+ recommended)
- Docker installed (optional, but recommended for easiest setup)
## Step 1: Install LocalAI
Choose the installation method that works best for you:
### Option A: Docker (Recommended for Beginners)
```bash
# Run LocalAI with AIO (All-in-One) image - includes pre-configured models
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu
```
This will:
- Download the LocalAI image
- Start the API server on port 8080
- Automatically download and configure models
### Option B: Quick Install Script (Linux)
```bash
curl https://localai.io/install.sh | sh
```
### Option C: macOS DMG
Download the DMG from [GitHub Releases](https://github.com/mudler/LocalAI/releases/latest/download/LocalAI.dmg) and install it.
For more installation options, see the [Quickstart Guide]({{% relref "docs/getting-started/quickstart" %}}).
## Step 2: Verify Installation
Once LocalAI is running, verify it's working:
```bash
# Check if the API is responding
curl http://localhost:8080/v1/models
```
You should see a JSON response listing available models. If using the AIO image, you'll see models like `gpt-4`, `gpt-4-vision-preview`, etc.
## Step 3: Access the WebUI
Open your web browser and navigate to:
```
http://localhost:8080
```
You'll see the LocalAI WebUI with:
- A chat interface
- Model gallery
- Backend management
- Configuration options
## Step 4: Your First Chat
### Using the WebUI
1. In the WebUI, you'll see a chat interface
2. Select a model from the dropdown (if multiple models are available)
3. Type your message and press Enter
4. Wait for the AI to respond!
### Using the API (Command Line)
You can also chat using curl:
```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Hello! Can you introduce yourself?"}
],
"temperature": 0.7
}'
```
### Using Python
```python
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Hello! Can you introduce yourself?"}
],
"temperature": 0.7
}
)
print(response.json()["choices"][0]["message"]["content"])
```
## Step 5: Try Different Models
If you're using the AIO image, you have several models pre-installed:
- `gpt-4` - Text generation
- `gpt-4-vision-preview` - Vision and text
- `tts-1` - Text to speech
- `whisper-1` - Speech to text
Try asking the vision model about an image, or generate speech with the TTS model!
### Installing New Models via WebUI
To install additional models, you can use the WebUI's import interface:
1. In the WebUI, navigate to the "Models" tab
2. Click "Import Model" or "New Model"
3. Enter a model URI (e.g., `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`)
4. Configure preferences or use Advanced Mode for YAML editing
5. Click "Import Model" to start the installation
For more details, see [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}}).
## Troubleshooting
### Port 8080 is already in use
Change the port mapping:
```bash
docker run -p 8081:8080 --name local-ai -ti localai/localai:latest-aio-cpu
```
Then access at `http://localhost:8081`
### No models available
If you're using a standard (non-AIO) image, you need to install models. See [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}}) tutorial.
### Slow responses
- Check if you have enough RAM
- Consider using a smaller model
- Enable GPU acceleration (see [Using GPU]({{% relref "docs/tutorials/using-gpu" %}}))
## What's Next?
Congratulations! You've successfully set up LocalAI and had your first chat. Here's what to explore next:
1. **[Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})** - Learn how to install and configure different models
2. **[Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}})** - Speed up inference with GPU support
3. **[Try It Out]({{% relref "docs/getting-started/try-it-out" %}})** - Explore more API endpoints and features
4. **[Features Documentation]({{% relref "docs/features" %}})** - Discover all LocalAI capabilities
## See Also
- [Quickstart Guide]({{% relref "docs/getting-started/quickstart" %}})
- [FAQ]({{% relref "docs/faq" %}})
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}})

View File

@@ -0,0 +1,361 @@
+++
disableToc = false
title = "Integration Examples"
weight = 5
icon = "sync"
description = "Learn how to integrate LocalAI with popular frameworks and tools"
+++
This tutorial shows you how to integrate LocalAI with popular AI frameworks and tools. LocalAI's OpenAI-compatible API makes it easy to use as a drop-in replacement.
## Prerequisites
- LocalAI running and accessible
- Basic knowledge of the framework you want to integrate
- Python, Node.js, or other runtime as needed
## Python Integrations
### LangChain
LangChain has built-in support for LocalAI:
```python
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
# For chat models
llm = ChatOpenAI(
openai_api_key="not-needed",
openai_api_base="http://localhost:8080/v1",
model_name="gpt-4"
)
response = llm.predict("Hello, how are you?")
print(response)
```
### OpenAI Python SDK
The official OpenAI Python SDK works directly with LocalAI:
```python
import openai
openai.api_base = "http://localhost:8080/v1"
openai.api_key = "not-needed"
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)
```
### LangChain with LocalAI Functions
```python
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_key="not-needed",
openai_api_base="http://localhost:8080/v1"
)
tools = [
Tool(
name="Calculator",
func=lambda x: eval(x),
description="Useful for mathematical calculations"
)
]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description")
result = agent.run("What is 25 * 4?")
```
## JavaScript/TypeScript Integrations
### OpenAI Node.js SDK
```javascript
import OpenAI from 'openai';
const openai = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'not-needed',
});
async function main() {
const completion = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(completion.choices[0].message.content);
}
main();
```
### LangChain.js
```javascript
import { ChatOpenAI } from "langchain/chat_models/openai";
const model = new ChatOpenAI({
openAIApiKey: "not-needed",
configuration: {
baseURL: "http://localhost:8080/v1",
},
modelName: "gpt-4",
});
const response = await model.invoke("Hello, how are you?");
console.log(response.content);
```
## Integration with Specific Tools
### AutoGPT
AutoGPT can use LocalAI by setting the API base URL:
```bash
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=not-needed
```
Then run AutoGPT normally.
### Flowise
Flowise supports LocalAI out of the box. In the Flowise UI:
1. Add a ChatOpenAI node
2. Set the base URL to `http://localhost:8080/v1`
3. Set API key to any value (or leave empty)
4. Select your model
### Continue (VS Code Extension)
Configure Continue to use LocalAI:
```json
{
"models": [
{
"title": "LocalAI",
"provider": "openai",
"model": "gpt-4",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}
]
}
```
### AnythingLLM
AnythingLLM has native LocalAI support:
1. Go to Settings > LLM Preference
2. Select "LocalAI"
3. Enter your LocalAI endpoint: `http://localhost:8080`
4. Select your model
## REST API Examples
### cURL
```bash
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# List models
curl http://localhost:8080/v1/models
# Embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "text-embedding-ada-002",
"input": "Hello world"
}'
```
### Python Requests
```python
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}
)
print(response.json())
```
## Advanced Integrations
### Custom Wrapper
Create a custom wrapper for your application:
```python
class LocalAIClient:
def __init__(self, base_url="http://localhost:8080/v1"):
self.base_url = base_url
self.api_key = "not-needed"
def chat(self, messages, model="gpt-4", **kwargs):
response = requests.post(
f"{self.base_url}/chat/completions",
json={
"model": model,
"messages": messages,
**kwargs
},
headers={"Authorization": f"Bearer {self.api_key}"}
)
return response.json()
def embeddings(self, text, model="text-embedding-ada-002"):
response = requests.post(
f"{self.base_url}/embeddings",
json={
"model": model,
"input": text
}
)
return response.json()
```
### Streaming Responses
```python
import requests
import json
def stream_chat(messages, model="gpt-4"):
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": model,
"messages": messages,
"stream": True
},
stream=True
)
for line in response.iter_lines():
if line:
data = json.loads(line.decode('utf-8').replace('data: ', ''))
if 'choices' in data:
content = data['choices'][0].get('delta', {}).get('content', '')
if content:
yield content
```
## Common Integration Patterns
### Error Handling
```python
import requests
from requests.exceptions import RequestException
def safe_chat_request(messages, model="gpt-4", retries=3):
for attempt in range(retries):
try:
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={"model": model, "messages": messages},
timeout=30
)
response.raise_for_status()
return response.json()
except RequestException as e:
if attempt == retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
```
### Rate Limiting
```python
from functools import wraps
import time
def rate_limit(calls_per_second=2):
min_interval = 1.0 / calls_per_second
last_called = [0.0]
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
left_to_wait = min_interval - elapsed
if left_to_wait > 0:
time.sleep(left_to_wait)
ret = func(*args, **kwargs)
last_called[0] = time.time()
return ret
return wrapper
return decorator
@rate_limit(calls_per_second=2)
def chat_request(messages):
# Your chat request here
pass
```
## Testing Integrations
### Unit Tests
```python
import unittest
from unittest.mock import patch, Mock
import requests
class TestLocalAIIntegration(unittest.TestCase):
@patch('requests.post')
def test_chat_completion(self, mock_post):
mock_response = Mock()
mock_response.json.return_value = {
"choices": [{
"message": {"content": "Hello!"}
}]
}
mock_post.return_value = mock_response
# Your integration code here
# Assertions
```
## What's Next?
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
- [Integrations]({{% relref "docs/integrations" %}}) - List of compatible projects
- [Examples Repository](https://github.com/mudler/LocalAI-examples) - More integration examples
## See Also
- [Features Documentation]({{% relref "docs/features" %}}) - All LocalAI capabilities
- [FAQ]({{% relref "docs/faq" %}}) - Common integration questions
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Integration issues

View File

@@ -0,0 +1,267 @@
+++
disableToc = false
title = "Setting Up Models"
weight = 2
icon = "hub"
description = "Learn how to install, configure, and manage models in LocalAI"
+++
This tutorial covers everything you need to know about installing and configuring models in LocalAI. You'll learn multiple methods to get models running.
## Prerequisites
- LocalAI installed and running (see [Your First Chat]({{% relref "docs/tutorials/first-chat" %}}) if you haven't set it up yet)
- Basic understanding of command line usage
## Method 1: Using the Model Gallery (Easiest)
The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.
### Via WebUI
1. Open the LocalAI WebUI at `http://localhost:8080`
2. Navigate to the "Models" tab
3. Browse available models
4. Click "Install" on any model you want
5. Wait for installation to complete
## Method 1.5: Import Models via WebUI
The WebUI provides a powerful model import interface that supports both simple and advanced configuration:
### Simple Import Mode
1. Open the LocalAI WebUI at `http://localhost:8080`
2. Click "Import Model"
3. Enter the model URI (e.g., `https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF`)
4. Optionally configure preferences:
- Backend selection
- Model name
- Description
- Quantizations
- Embeddings support
- Custom preferences
5. Click "Import Model" to start the import process
### Advanced Import Mode
For full control over model configuration:
1. In the WebUI, click "Import Model"
2. Toggle to "Advanced Mode"
3. Edit the YAML configuration directly in the code editor
4. Use the "Validate" button to check your configuration
5. Click "Create" or "Update" to save
The advanced editor includes:
- Syntax highlighting
- YAML validation
- Format and copy tools
- Full configuration options
This is especially useful for:
- Custom model configurations
- Fine-tuning model parameters
- Setting up complex model setups
- Editing existing model configurations
### Via CLI
```bash
# List available models
local-ai models list
# Install a specific model
local-ai models install llama-3.2-1b-instruct:q4_k_m
# Start LocalAI with a model from the gallery
local-ai run llama-3.2-1b-instruct:q4_k_m
```
### Browse Online
Visit [models.localai.io](https://models.localai.io) to browse all available models in your browser.
## Method 2: Installing from Hugging Face
LocalAI can directly install models from Hugging Face:
```bash
# Install and run a model from Hugging Face
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
```
The format is: `huggingface://<repository>/<model-file>`
## Method 3: Installing from OCI Registries
### Ollama Registry
```bash
local-ai run ollama://gemma:2b
```
### Standard OCI Registry
```bash
local-ai run oci://localai/phi-2:latest
```
## Method 4: Manual Installation
For full control, you can manually download and configure models.
### Step 1: Download a Model
Download a GGUF model file. Popular sources:
- [Hugging Face](https://huggingface.co/models?search=gguf)
Example:
```bash
mkdir -p models
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
-O models/phi-2.Q4_K_M.gguf
```
### Step 2: Create a Configuration File (Optional)
Create a YAML file to configure the model:
```yaml
# models/phi-2.yaml
name: phi-2
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.7
context_size: 2048
threads: 4
backend: llama-cpp
```
### Step 3: Start LocalAI
```bash
# With Docker
docker run -p 8080:8080 -v $PWD/models:/models \
localai/localai:latest
# Or with binary
local-ai --models-path ./models
```
## Understanding Model Files
### File Formats
- **GGUF**: Modern format, recommended for most use cases
- **GGML**: Older format, still supported but deprecated
### Quantization Levels
Models come in different quantization levels (quality vs. size trade-off):
| Quantization | Size | Quality | Use Case |
|-------------|------|---------|----------|
| Q8_0 | Largest | Highest | Best quality, requires more RAM |
| Q6_K | Large | Very High | High quality |
| Q4_K_M | Medium | High | Balanced (recommended) |
| Q4_K_S | Small | Medium | Lower RAM usage |
| Q2_K | Smallest | Lower | Minimal RAM, lower quality |
### Choosing the Right Model
Consider:
- **RAM available**: Larger models need more RAM
- **Use case**: Different models excel at different tasks
- **Speed**: Smaller quantizations are faster
- **Quality**: Higher quantizations produce better output
## Model Configuration
### Basic Configuration
Create a YAML file in your models directory:
```yaml
name: my-model
parameters:
model: model.gguf
temperature: 0.7
top_p: 0.9
context_size: 2048
threads: 4
backend: llama-cpp
```
### Advanced Configuration
See the [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) guide for all available options.
## Managing Models
### List Installed Models
```bash
# Via API
curl http://localhost:8080/v1/models
# Via CLI
local-ai models list
```
### Remove Models
Simply delete the model file and configuration from your models directory:
```bash
rm models/model-name.gguf
rm models/model-name.yaml # if exists
```
## Troubleshooting
### Model Not Loading
1. **Check backend**: Ensure the required backend is installed
```bash
local-ai backends list
local-ai backends install llama-cpp # if needed
```
2. **Check logs**: Enable debug mode
```bash
DEBUG=true local-ai
```
3. **Verify file**: Ensure the model file is not corrupted
### Out of Memory
- Use a smaller quantization (Q4_K_S or Q2_K)
- Reduce `context_size` in configuration
- Close other applications to free RAM
### Wrong Backend
Check the [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) to ensure you're using the correct backend for your model.
## Best Practices
1. **Start small**: Begin with smaller models to test your setup
2. **Use quantized models**: Q4_K_M is a good balance for most use cases
3. **Organize models**: Keep your models directory organized
4. **Backup configurations**: Save your YAML configurations
5. **Monitor resources**: Watch RAM and disk usage
## What's Next?
- [Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}}) - Speed up inference
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Advanced configuration options
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - Find compatible models and backends
## See Also
- [Model Gallery Documentation]({{% relref "docs/features/model-gallery" %}})
- [Install and Run Models]({{% relref "docs/getting-started/models" %}})
- [FAQ]({{% relref "docs/faq" %}})

View File

@@ -0,0 +1,254 @@
+++
disableToc = false
title = "Using GPU Acceleration"
weight = 3
icon = "memory"
description = "Set up GPU acceleration for faster inference"
+++
This tutorial will guide you through setting up GPU acceleration for LocalAI. GPU acceleration can significantly speed up model inference, especially for larger models.
## Prerequisites
- A compatible GPU (NVIDIA, AMD, Intel, or Apple Silicon)
- LocalAI installed
- Basic understanding of your system's GPU setup
## Check Your GPU
First, verify you have a compatible GPU:
### NVIDIA
```bash
nvidia-smi
```
You should see your GPU information. Ensure you have CUDA 11.7 or 12.0+ installed.
### AMD
```bash
rocminfo
```
### Intel
```bash
intel_gpu_top # if available
```
### Apple Silicon (macOS)
Apple Silicon (M1/M2/M3) GPUs are automatically detected. No additional setup needed!
## Installation Methods
### Method 1: Docker with GPU Support (Recommended)
#### NVIDIA CUDA
```bash
# CUDA 12.0
docker run -p 8080:8080 --gpus all --name local-ai \
-ti localai/localai:latest-gpu-nvidia-cuda-12
# CUDA 11.7
docker run -p 8080:8080 --gpus all --name local-ai \
-ti localai/localai:latest-gpu-nvidia-cuda-11
```
**Prerequisites**: Install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
#### AMD ROCm
```bash
docker run -p 8080:8080 \
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--name local-ai \
-ti localai/localai:latest-gpu-hipblas
```
#### Intel GPU
```bash
docker run -p 8080:8080 --name local-ai \
-ti localai/localai:latest-gpu-intel
```
#### Apple Silicon
GPU acceleration works automatically when running on macOS with Apple Silicon. Use the standard CPU image - Metal acceleration is built-in.
### Method 2: AIO Images with GPU
AIO images are also available with GPU support:
```bash
# NVIDIA CUDA 12
docker run -p 8080:8080 --gpus all --name local-ai \
-ti localai/localai:latest-aio-gpu-nvidia-cuda-12
# AMD
docker run -p 8080:8080 \
--device=/dev/kfd --device=/dev/dri --group-add=video \
--name local-ai \
-ti localai/localai:latest-aio-gpu-hipblas
```
### Method 3: Build from Source
For building with GPU support from source, see the [Build Guide]({{% relref "docs/getting-started/build" %}}).
## Configuring Models for GPU
### Automatic Detection
LocalAI automatically detects GPU capabilities and downloads the appropriate backend when you install models from the gallery.
### Manual Configuration
In your model YAML configuration, specify GPU layers:
```yaml
name: my-model
parameters:
model: model.gguf
backend: llama-cpp
# Offload layers to GPU (adjust based on your GPU memory)
f16: true
gpu_layers: 35 # Number of layers to offload to GPU
```
**GPU Layers Guidelines**:
- **Small GPU (4-6GB)**: 20-30 layers
- **Medium GPU (8-12GB)**: 30-40 layers
- **Large GPU (16GB+)**: 40+ layers or set to model's total layer count
### Finding the Right Number of Layers
1. Start with a conservative number (e.g., 20)
2. Monitor GPU memory usage: `nvidia-smi` (NVIDIA) or `rocminfo` (AMD)
3. Gradually increase until you reach GPU memory limits
4. For maximum performance, offload all layers if you have enough VRAM
## Verifying GPU Usage
### Check if GPU is Being Used
#### NVIDIA
```bash
# Watch GPU usage in real-time
watch -n 1 nvidia-smi
```
You should see:
- GPU utilization > 0%
- Memory usage increasing
- Processes running on GPU
#### AMD
```bash
rocm-smi
```
#### Check Logs
Enable debug mode to see GPU information in logs:
```bash
DEBUG=true local-ai
```
Look for messages indicating GPU initialization and layer offloading.
## Performance Tips
### 1. Optimize GPU Layers
- Offload as many layers as your GPU memory allows
- Balance between GPU and CPU layers for best performance
- Use `f16: true` for better GPU performance
### 2. Batch Processing
GPU excels at batch processing. Process multiple requests together when possible.
### 3. Model Quantization
Even with GPU, quantized models (Q4_K_M) often provide the best speed/quality balance.
### 4. Context Size
Larger context sizes use more GPU memory. Adjust based on your GPU:
```yaml
context_size: 4096 # Adjust based on GPU memory
```
## Troubleshooting
### GPU Not Detected
1. **Check drivers**: Ensure GPU drivers are installed
2. **Check Docker**: Verify Docker has GPU access
```bash
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
```
3. **Check logs**: Enable debug mode and check for GPU-related errors
### Out of GPU Memory
- Reduce `gpu_layers` in model configuration
- Use a smaller model or lower quantization
- Reduce `context_size`
- Close other GPU-using applications
### Slow Performance
- Ensure you're using the correct GPU image
- Check that layers are actually offloaded (check logs)
- Verify GPU drivers are up to date
- Consider using a more powerful GPU or reducing model size
### CUDA Errors
- Ensure CUDA version matches (11.7 vs 12.0)
- Check CUDA compatibility with your GPU
- Try rebuilding with `REBUILD=true`
## Platform-Specific Notes
### NVIDIA Jetson (L4T)
Use the L4T-specific images:
```bash
docker run -p 8080:8080 --runtime nvidia --gpus all \
--name local-ai \
-ti localai/localai:latest-nvidia-l4t-arm64
```
### Apple Silicon
- Metal acceleration is automatic
- No special Docker flags needed
- Use standard CPU images - Metal is built-in
- For best performance, build from source on macOS
## What's Next?
- [GPU Acceleration Documentation]({{% relref "docs/features/gpu-acceleration" %}}) - Detailed GPU information
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize your setup
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - Manage GPU memory efficiently
## See Also
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - GPU support by backend
- [Build Guide]({{% relref "docs/getting-started/build" %}}) - Build with GPU support
- [FAQ]({{% relref "docs/faq" %}}) - Common GPU questions

View File

@@ -1,16 +1,60 @@
+++
disableToc = false
title = "News"
title = "What's New"
weight = 7
url = '/basics/news/'
icon = "newspaper"
+++
Release notes have been now moved completely over Github releases.
Release notes have been moved to GitHub releases for the most up-to-date information.
You can see the release notes [here](https://github.com/mudler/LocalAI/releases).
You can see all release notes [here](https://github.com/mudler/LocalAI/releases).
# Older release notes
## Recent Highlights
### 2025
**July 2025**: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. [Read the release notes](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
**June 2025**: [Backend management](https://github.com/mudler/LocalAI/pull/5607) has been added. Attention: extras images are going to be deprecated from the next release! Read [the backend management PR](https://github.com/mudler/LocalAI/pull/5607).
**May 2025**: [Audio input](https://github.com/mudler/LocalAI/pull/5466) and [Reranking](https://github.com/mudler/LocalAI/pull/5396) in llama.cpp backend, [Realtime API](https://github.com/mudler/LocalAI/pull/5392), Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
**May 2025**: Important: image name changes [See release](https://github.com/mudler/LocalAI/releases/tag/v2.29.0)
**April 2025**: Rebrand, WebUI enhancements
**April 2025**: [LocalAGI](https://github.com/mudler/LocalAGI) and [LocalRecall](https://github.com/mudler/LocalRecall) join the LocalAI family stack.
**April 2025**: WebUI overhaul, AIO images updates
**February 2025**: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
**January 2025**: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
### 2024
**December 2024**: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
**November 2024**: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
**November 2024**: Voice activity detection models (**VAD**) added to the API: https://github.com/mudler/LocalAI/pull/4204
**October 2024**: Examples moved to [LocalAI-examples](https://github.com/mudler/LocalAI-examples)
**August 2024**: 🆕 FLUX-1, [P2P Explorer](https://explorer.localai.io)
**July 2024**: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
**May 2024**: 🔥🔥 Decentralized P2P llama.cpp: https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs https://localai.io/features/distribute/
**May 2024**: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
**April 2024**: Reranker API: https://github.com/mudler/LocalAI/pull/2121
---
## Archive: Older Release Notes (2023 and earlier)
## 04-12-2023: __v2.0.0__
@@ -58,7 +102,7 @@ Thanks to @jespino now the local-ai binary has more subcommands allowing to mana
This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!
Check out the documentation for vllm [here](https://localai.io/model-compatibility/vllm/) and Vall-E-X [here](https://localai.io/model-compatibility/vall-e-x/)
Check out the documentation for vllm [here]({{% relref "docs/reference/compatibility-table" %}}) and Vall-E-X [here]({{% relref "docs/reference/compatibility-table" %}})
[Release notes](https://github.com/mudler/LocalAI/releases/tag/v1.30.0)