mirror of
https://github.com/mudler/LocalAI.git
synced 2026-02-03 11:13:31 -05:00
Compare commits
1 Commits
fix/mcp
...
docs/impro
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3e8a54f4b6 |
@@ -1,11 +1,38 @@
|
||||
---
|
||||
weight: 20
|
||||
title: "Advanced"
|
||||
description: "Advanced usage"
|
||||
icon: settings
|
||||
lead: ""
|
||||
date: 2020-10-06T08:49:15+00:00
|
||||
lastmod: 2020-10-06T08:49:15+00:00
|
||||
draft: false
|
||||
images: []
|
||||
---
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Advanced Configuration"
|
||||
weight = 20
|
||||
icon = "settings"
|
||||
description = "Advanced configuration and optimization for LocalAI"
|
||||
+++
|
||||
|
||||
This section covers advanced configuration, optimization, and fine-tuning options for LocalAI.
|
||||
|
||||
## Configuration
|
||||
|
||||
- **[Model Configuration]({{% relref "docs/advanced/model-configuration" %}})** - Complete model configuration reference
|
||||
- **[Advanced Usage]({{% relref "docs/advanced/advanced-usage" %}})** - Advanced configuration options
|
||||
- **[Installer Options]({{% relref "docs/advanced/installer" %}})** - Installer configuration and options
|
||||
|
||||
## Performance & Optimization
|
||||
|
||||
- **[Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}})** - Optimize for maximum performance
|
||||
- **[VRAM Management]({{% relref "docs/advanced/vram-management" %}})** - Manage GPU memory efficiently
|
||||
|
||||
## Specialized Topics
|
||||
|
||||
- **[Fine-tuning]({{% relref "docs/advanced/fine-tuning" %}})** - Fine-tune models for LocalAI
|
||||
|
||||
## Before You Begin
|
||||
|
||||
Make sure you have:
|
||||
- LocalAI installed and running
|
||||
- Basic understanding of YAML configuration
|
||||
- Familiarity with your system's resources
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [Getting Started]({{% relref "docs/getting-started" %}}) - Installation and basics
|
||||
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration reference
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Common issues
|
||||
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimization guide
|
||||
344
docs/content/docs/advanced/performance-tuning.md
Normal file
344
docs/content/docs/advanced/performance-tuning.md
Normal file
@@ -0,0 +1,344 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Performance Tuning"
|
||||
weight = 22
|
||||
icon = "speed"
|
||||
description = "Optimize LocalAI for maximum performance"
|
||||
+++
|
||||
|
||||
This guide covers techniques to optimize LocalAI performance for your specific hardware and use case.
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
Before optimizing, establish baseline metrics:
|
||||
|
||||
- **Tokens per second**: Measure inference speed
|
||||
- **Memory usage**: Monitor RAM and VRAM
|
||||
- **Latency**: Time to first token and total response time
|
||||
- **Throughput**: Requests per second
|
||||
|
||||
Enable debug mode to see performance stats:
|
||||
|
||||
```bash
|
||||
DEBUG=true local-ai
|
||||
```
|
||||
|
||||
Look for output like:
|
||||
```
|
||||
llm_load_tensors: tok/s: 45.23
|
||||
```
|
||||
|
||||
## CPU Optimization
|
||||
|
||||
### Thread Configuration
|
||||
|
||||
Match threads to CPU cores:
|
||||
|
||||
```yaml
|
||||
# Model configuration
|
||||
threads: 4 # For 4-core CPU
|
||||
```
|
||||
|
||||
**Guidelines**:
|
||||
- Use number of physical cores (not hyperthreads)
|
||||
- Leave 1-2 cores for system
|
||||
- Too many threads can hurt performance
|
||||
|
||||
### CPU Instructions
|
||||
|
||||
Enable appropriate CPU instructions:
|
||||
|
||||
```bash
|
||||
# Check available instructions
|
||||
cat /proc/cpuinfo | grep flags
|
||||
|
||||
# Build with optimizations
|
||||
CMAKE_ARGS="-DGGML_AVX2=ON -DGGML_AVX512=ON" make build
|
||||
```
|
||||
|
||||
### NUMA Optimization
|
||||
|
||||
For multi-socket systems:
|
||||
|
||||
```yaml
|
||||
numa: true
|
||||
```
|
||||
|
||||
### Memory Mapping
|
||||
|
||||
Enable memory mapping for faster model loading:
|
||||
|
||||
```yaml
|
||||
mmap: true
|
||||
mmlock: false # Set to true to lock in memory (faster but uses more RAM)
|
||||
```
|
||||
|
||||
## GPU Optimization
|
||||
|
||||
### Layer Offloading
|
||||
|
||||
Offload as many layers as GPU memory allows:
|
||||
|
||||
```yaml
|
||||
gpu_layers: 35 # Adjust based on GPU memory
|
||||
f16: true # Use FP16 for better performance
|
||||
```
|
||||
|
||||
**Finding optimal layers**:
|
||||
1. Start with 20 layers
|
||||
2. Monitor GPU memory: `nvidia-smi` or `rocm-smi`
|
||||
3. Gradually increase until near memory limit
|
||||
4. For maximum performance, offload all layers if possible
|
||||
|
||||
### Batch Processing
|
||||
|
||||
GPU excels at batch processing. Process multiple requests together when possible.
|
||||
|
||||
### Mixed Precision
|
||||
|
||||
Use FP16 when supported:
|
||||
|
||||
```yaml
|
||||
f16: true
|
||||
```
|
||||
|
||||
## Model Optimization
|
||||
|
||||
### Quantization
|
||||
|
||||
Choose appropriate quantization:
|
||||
|
||||
| Quantization | Speed | Quality | Memory | Use Case |
|
||||
|-------------|-------|---------|--------|----------|
|
||||
| Q8_0 | Slowest | Highest | Most | Maximum quality |
|
||||
| Q6_K | Slow | Very High | High | High quality |
|
||||
| Q4_K_M | Medium | High | Medium | **Recommended** |
|
||||
| Q4_K_S | Fast | Medium | Low | Balanced |
|
||||
| Q2_K | Fastest | Lower | Least | Speed priority |
|
||||
|
||||
### Context Size
|
||||
|
||||
Reduce context size for faster inference:
|
||||
|
||||
```yaml
|
||||
context_size: 2048 # Instead of 4096 or 8192
|
||||
```
|
||||
|
||||
**Trade-off**: Smaller context = faster but less conversation history
|
||||
|
||||
### Model Selection
|
||||
|
||||
Choose models appropriate for your hardware:
|
||||
|
||||
- **Small systems (4GB RAM)**: 1-3B parameter models
|
||||
- **Medium systems (8-16GB RAM)**: 3-7B parameter models
|
||||
- **Large systems (32GB+ RAM)**: 7B+ parameter models
|
||||
|
||||
## Configuration Optimizations
|
||||
|
||||
### Sampling Parameters
|
||||
|
||||
Optimize sampling for speed:
|
||||
|
||||
```yaml
|
||||
parameters:
|
||||
temperature: 0.7
|
||||
top_p: 0.9
|
||||
top_k: 40
|
||||
mirostat: 0 # Disable for speed (enabled by default)
|
||||
```
|
||||
|
||||
**Note**: Disabling mirostat improves speed but may reduce quality.
|
||||
|
||||
### Prompt Caching
|
||||
|
||||
Enable prompt caching for repeated queries:
|
||||
|
||||
```yaml
|
||||
prompt_cache_path: "cache"
|
||||
prompt_cache_all: true
|
||||
```
|
||||
|
||||
### Parallel Requests
|
||||
|
||||
LocalAI supports parallel requests. Configure appropriately:
|
||||
|
||||
```yaml
|
||||
# In model config
|
||||
parallel_requests: 4 # Adjust based on hardware
|
||||
```
|
||||
|
||||
## Storage Optimization
|
||||
|
||||
### Use SSD
|
||||
|
||||
Always use SSD for model storage:
|
||||
- HDD: Very slow model loading
|
||||
- SSD: Fast loading, better performance
|
||||
|
||||
### Disable MMAP on HDD
|
||||
|
||||
If stuck with HDD:
|
||||
|
||||
```yaml
|
||||
mmap: false # Loads entire model into RAM
|
||||
```
|
||||
|
||||
### Model Location
|
||||
|
||||
Store models on fastest storage:
|
||||
- Local SSD: Best performance
|
||||
- Network storage: Slower, but allows sharing
|
||||
- External drive: Slowest
|
||||
|
||||
## System-Level Optimizations
|
||||
|
||||
### Process Priority
|
||||
|
||||
Increase process priority (Linux):
|
||||
|
||||
```bash
|
||||
nice -n -10 local-ai
|
||||
```
|
||||
|
||||
### CPU Governor
|
||||
|
||||
Set CPU to performance mode (Linux):
|
||||
|
||||
```bash
|
||||
# Check current governor
|
||||
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
|
||||
# Set to performance
|
||||
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
```
|
||||
|
||||
### Disable Swapping
|
||||
|
||||
Prevent swapping for better performance:
|
||||
|
||||
```bash
|
||||
# Linux
|
||||
sudo swapoff -a
|
||||
|
||||
# Or set swappiness to 0
|
||||
echo 0 | sudo tee /proc/sys/vm/swappiness
|
||||
```
|
||||
|
||||
### Memory Allocation
|
||||
|
||||
For large models, consider huge pages (Linux):
|
||||
|
||||
```bash
|
||||
# Allocate huge pages
|
||||
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages
|
||||
```
|
||||
|
||||
## Benchmarking
|
||||
|
||||
### Measure Performance
|
||||
|
||||
Create a benchmark script:
|
||||
|
||||
```python
|
||||
import time
|
||||
import requests
|
||||
|
||||
start = time.time()
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={
|
||||
"model": "gpt-4",
|
||||
"messages": [{"role": "user", "content": "Hello"}]
|
||||
}
|
||||
)
|
||||
elapsed = time.time() - start
|
||||
|
||||
tokens = response.json()["usage"]["completion_tokens"]
|
||||
tokens_per_second = tokens / elapsed
|
||||
|
||||
print(f"Time: {elapsed:.2f}s")
|
||||
print(f"Tokens: {tokens}")
|
||||
print(f"Speed: {tokens_per_second:.2f} tok/s")
|
||||
```
|
||||
|
||||
### Compare Configurations
|
||||
|
||||
Test different configurations:
|
||||
1. Baseline: Default settings
|
||||
2. Optimized: Your optimizations
|
||||
3. Measure: Tokens/second, latency, memory
|
||||
|
||||
### Load Testing
|
||||
|
||||
Test under load:
|
||||
|
||||
```bash
|
||||
# Use Apache Bench or similar
|
||||
ab -n 100 -c 10 -p request.json -T application/json \
|
||||
http://localhost:8080/v1/chat/completions
|
||||
```
|
||||
|
||||
## Platform-Specific Tips
|
||||
|
||||
### Apple Silicon
|
||||
|
||||
- Metal acceleration is automatic
|
||||
- Use native builds (not Docker) for best performance
|
||||
- M1/M2/M3 have unified memory - optimize accordingly
|
||||
|
||||
### NVIDIA GPUs
|
||||
|
||||
- Use CUDA 12 for latest optimizations
|
||||
- Enable Tensor Cores with appropriate precision
|
||||
- Monitor with `nvidia-smi` for bottlenecks
|
||||
|
||||
### AMD GPUs
|
||||
|
||||
- Use ROCm/HIPBLAS backend
|
||||
- Check ROCm compatibility
|
||||
- Monitor with `rocm-smi`
|
||||
|
||||
### Intel GPUs
|
||||
|
||||
- Use oneAPI/SYCL backend
|
||||
- Check Intel GPU compatibility
|
||||
- Optimize for F16/F32 precision
|
||||
|
||||
## Common Performance Issues
|
||||
|
||||
### Slow First Response
|
||||
|
||||
**Cause**: Model loading
|
||||
**Solution**: Pre-load models or use model warming
|
||||
|
||||
### Degrading Performance
|
||||
|
||||
**Cause**: Memory fragmentation
|
||||
**Solution**: Restart LocalAI periodically
|
||||
|
||||
### Inconsistent Speed
|
||||
|
||||
**Cause**: System load, thermal throttling
|
||||
**Solution**: Monitor system resources, ensure cooling
|
||||
|
||||
## Performance Checklist
|
||||
|
||||
- [ ] Threads match CPU cores
|
||||
- [ ] GPU layers optimized
|
||||
- [ ] Appropriate quantization selected
|
||||
- [ ] Context size optimized
|
||||
- [ ] Models on SSD
|
||||
- [ ] MMAP enabled (if using SSD)
|
||||
- [ ] Mirostat disabled (if speed priority)
|
||||
- [ ] System resources monitored
|
||||
- [ ] Baseline metrics established
|
||||
- [ ] Optimizations tested and verified
|
||||
|
||||
## See Also
|
||||
|
||||
- [GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}}) - GPU setup
|
||||
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - GPU memory
|
||||
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration options
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Performance issues
|
||||
|
||||
@@ -14,7 +14,31 @@ Here are answers to some of the most common questions.
|
||||
|
||||
### How do I get models?
|
||||
|
||||
Most gguf-based models should work, but newer models may require additions to the API. If a model doesn't work, please feel free to open up issues. However, be cautious about downloading models from the internet and directly onto your machine, as there may be security vulnerabilities in lama.cpp or ggml that could be maliciously exploited. Some models can be found on Hugging Face: https://huggingface.co/models?search=gguf, or models from gpt4all are compatible too: https://github.com/nomic-ai/gpt4all.
|
||||
There are several ways to get models for LocalAI:
|
||||
|
||||
1. **WebUI Import** (Easiest): Use the WebUI's model import interface:
|
||||
- Open `http://localhost:8080` and navigate to the Models tab
|
||||
- Click "Import Model" or "New Model"
|
||||
- Enter a model URI (Hugging Face, OCI, file path, etc.)
|
||||
- Configure preferences in Simple Mode or edit YAML in Advanced Mode
|
||||
- The WebUI provides syntax highlighting, validation, and a user-friendly interface
|
||||
|
||||
2. **Model Gallery** (Recommended): Use the built-in model gallery accessible via:
|
||||
- WebUI: Navigate to the Models tab in the LocalAI interface and browse available models
|
||||
- CLI: `local-ai models list` to see available models, then `local-ai models install <model-name>`
|
||||
- Online: Browse models at [models.localai.io](https://models.localai.io)
|
||||
|
||||
3. **Hugging Face**: Most GGUF-based models from Hugging Face work with LocalAI. You can install them via:
|
||||
- WebUI: Import using `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`
|
||||
- CLI: `local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`
|
||||
|
||||
4. **Manual Installation**: Download model files and place them in your models directory. See [Install and Run Models]({{% relref "docs/getting-started/models" %}}) for details.
|
||||
|
||||
5. **OCI Registries**: Install models from OCI-compatible registries:
|
||||
- WebUI: Import using `ollama://gemma:2b` or `oci://localai/phi-2:latest`
|
||||
- CLI: `local-ai run ollama://gemma:2b` or `local-ai run oci://localai/phi-2:latest`
|
||||
|
||||
**Security Note**: Be cautious when downloading models from the internet. Always verify the source and use trusted repositories when possible.
|
||||
|
||||
### Where are models stored?
|
||||
|
||||
@@ -70,7 +94,15 @@ There is GPU support, see {{%relref "docs/features/GPU-acceleration" %}}.
|
||||
|
||||
### Where is the webUI?
|
||||
|
||||
There is the availability of localai-webui and chatbot-ui in the examples section and can be setup as per the instructions. However as LocalAI is an API you can already plug it into existing projects that provides are UI interfaces to OpenAI's APIs. There are several already on Github, and should be compatible with LocalAI already (as it mimics the OpenAI API)
|
||||
LocalAI includes a built-in WebUI that is automatically available when you start LocalAI. Simply navigate to `http://localhost:8080` in your web browser after starting LocalAI.
|
||||
|
||||
The WebUI provides:
|
||||
- Chat interface for interacting with models
|
||||
- Model gallery browser and installer
|
||||
- Backend management
|
||||
- Configuration tools
|
||||
|
||||
If you prefer a different interface, LocalAI is compatible with any OpenAI-compatible UI. You can find examples in the [LocalAI-examples repository](https://github.com/mudler/LocalAI-examples), including integrations with popular UIs like chatbot-ui.
|
||||
|
||||
### Does it work with AutoGPT?
|
||||
|
||||
@@ -88,3 +120,96 @@ This typically happens when your prompt exceeds the context size. Try to reduce
|
||||
### I'm getting a 'SIGILL' error, what's wrong?
|
||||
|
||||
Your CPU probably does not have support for certain instructions that are compiled by default in the pre-built binaries. If you are running in a container, try setting `REBUILD=true` and disable the CPU instructions that are not compatible with your CPU. For instance: `CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF" make build`
|
||||
|
||||
Alternatively, you can use the backend management system to install a compatible backend for your CPU architecture. See [Backend Management]({{% relref "docs/features/backends" %}}) for more information.
|
||||
|
||||
### How do I install backends?
|
||||
|
||||
LocalAI now uses a backend management system where backends are automatically downloaded when needed. You can also manually install backends:
|
||||
|
||||
```bash
|
||||
# List available backends
|
||||
local-ai backends list
|
||||
|
||||
# Install a specific backend
|
||||
local-ai backends install llama-cpp
|
||||
|
||||
# Install a backend for a specific GPU type
|
||||
local-ai backends install llama-cpp --gpu-type nvidia
|
||||
```
|
||||
|
||||
For more details, see the [Backends documentation]({{% relref "docs/features/backends" %}}).
|
||||
|
||||
### How do I set up API keys for security?
|
||||
|
||||
You can secure your LocalAI instance by setting API keys using the `API_KEY` environment variable:
|
||||
|
||||
```bash
|
||||
# Single API key
|
||||
API_KEY=your-secret-key local-ai
|
||||
|
||||
# Multiple API keys (comma-separated)
|
||||
API_KEY=key1,key2,key3 local-ai
|
||||
```
|
||||
|
||||
When API keys are set, all requests must include the key in the `Authorization` header:
|
||||
```bash
|
||||
curl http://localhost:8080/v1/models \
|
||||
-H "Authorization: Bearer your-secret-key"
|
||||
```
|
||||
|
||||
**Important**: API keys provide full access to all LocalAI features (admin-level access). Make sure to protect your API keys and use HTTPS when exposing LocalAI remotely.
|
||||
|
||||
### My model is not loading or showing errors
|
||||
|
||||
Here are common issues and solutions:
|
||||
|
||||
1. **Backend not installed**: The required backend may not be installed. Check with `local-ai backends list` and install if needed.
|
||||
2. **Insufficient memory**: Large models require significant RAM. Check available memory and consider using a smaller quantized model.
|
||||
3. **Wrong backend specified**: Ensure the backend in your model configuration matches the model type. See the [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}).
|
||||
4. **Model file corruption**: Re-download the model file.
|
||||
5. **Check logs**: Enable debug mode (`DEBUG=true`) to see detailed error messages.
|
||||
|
||||
For more troubleshooting help, see the [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}).
|
||||
|
||||
### How do I use GPU acceleration?
|
||||
|
||||
LocalAI supports multiple GPU types:
|
||||
|
||||
- **NVIDIA (CUDA)**: Use `--gpus all` with Docker and CUDA-enabled images
|
||||
- **AMD (ROCm)**: Use images with `hipblas` tag
|
||||
- **Intel**: Use images with `intel` tag or Intel oneAPI
|
||||
- **Apple Silicon (Metal)**: Automatically detected on macOS
|
||||
|
||||
For detailed setup instructions, see [GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}}).
|
||||
|
||||
### Can I use LocalAI with LangChain, AutoGPT, or other frameworks?
|
||||
|
||||
Yes! LocalAI is compatible with any framework that supports OpenAI's API. Simply point the framework to your LocalAI endpoint:
|
||||
|
||||
```python
|
||||
# Example with LangChain
|
||||
from langchain.llms import OpenAI
|
||||
|
||||
llm = OpenAI(
|
||||
openai_api_key="not-needed",
|
||||
openai_api_base="http://localhost:8080/v1"
|
||||
)
|
||||
```
|
||||
|
||||
See the [Integrations]({{% relref "docs/integrations" %}}) page for a list of compatible projects and examples.
|
||||
|
||||
### What's the difference between AIO images and standard images?
|
||||
|
||||
**AIO (All-in-One) images** come pre-configured with:
|
||||
- Pre-installed models ready to use
|
||||
- All necessary backends included
|
||||
- Quick start with no configuration needed
|
||||
|
||||
**Standard images** are:
|
||||
- Smaller in size
|
||||
- No pre-installed models
|
||||
- You install models and backends as needed
|
||||
- More flexible for custom setups
|
||||
|
||||
Choose AIO images for quick testing and standard images for production deployments. See [Container Images]({{% relref "docs/getting-started/container-images" %}}) for details.
|
||||
|
||||
@@ -1,8 +1,56 @@
|
||||
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Features"
|
||||
weight = 8
|
||||
icon = "feature_search"
|
||||
url = "/features/"
|
||||
description = "Explore all LocalAI capabilities and features"
|
||||
+++
|
||||
|
||||
LocalAI provides a comprehensive set of AI capabilities, all running locally with OpenAI-compatible APIs.
|
||||
|
||||
## Core Features
|
||||
|
||||
### Text Generation
|
||||
|
||||
- **[Text Generation]({{% relref "docs/features/text-generation" %}})** - Generate text with various LLMs
|
||||
- **[OpenAI Functions]({{% relref "docs/features/openai-functions" %}})** - Function calling and tools API
|
||||
- **[Constrained Grammars]({{% relref "docs/features/constrained_grammars" %}})** - Structured output generation
|
||||
- **[Model Context Protocol (MCP)]({{% relref "docs/features/mcp" %}})** - Agentic capabilities
|
||||
|
||||
### Multimodal
|
||||
|
||||
- **[GPT Vision]({{% relref "docs/features/gpt-vision" %}})** - Image understanding and analysis
|
||||
- **[Image Generation]({{% relref "docs/features/image-generation" %}})** - Create images from text
|
||||
- **[Object Detection]({{% relref "docs/features/object-detection" %}})** - Detect objects in images
|
||||
|
||||
### Audio
|
||||
|
||||
- **[Text to Audio]({{% relref "docs/features/text-to-audio" %}})** - Generate speech from text
|
||||
- **[Audio to Text]({{% relref "docs/features/audio-to-text" %}})** - Transcribe audio to text
|
||||
|
||||
### Data & Search
|
||||
|
||||
- **[Embeddings]({{% relref "docs/features/embeddings" %}})** - Generate vector embeddings
|
||||
- **[Reranker]({{% relref "docs/features/reranker" %}})** - Document relevance scoring
|
||||
- **[Stores]({{% relref "docs/features/stores" %}})** - Vector database storage
|
||||
|
||||
## Infrastructure
|
||||
|
||||
- **[Backends]({{% relref "docs/features/backends" %}})** - Backend management and installation
|
||||
- **[GPU Acceleration]({{% relref "docs/features/gpu-acceleration" %}})** - GPU support and optimization
|
||||
- **[Model Gallery]({{% relref "docs/features/model-gallery" %}})** - Browse and install models
|
||||
- **[Distributed Inferencing]({{% relref "docs/features/distributed_inferencing" %}})** - P2P and distributed inference
|
||||
|
||||
## Getting Started with Features
|
||||
|
||||
1. **Install LocalAI**: See [Getting Started]({{% relref "docs/getting-started" %}})
|
||||
2. **Install Models**: See [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})
|
||||
3. **Try Features**: See [Try It Out]({{% relref "docs/getting-started/try-it-out" %}})
|
||||
4. **Configure**: See [Advanced Configuration]({{% relref "docs/advanced" %}})
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
|
||||
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - Supported models and backends
|
||||
- [Tutorials]({{% relref "docs/tutorials" %}}) - Step-by-step guides
|
||||
|
||||
@@ -1,7 +1,49 @@
|
||||
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Getting started"
|
||||
title = "Getting Started"
|
||||
weight = 2
|
||||
icon = "rocket_launch"
|
||||
description = "Install LocalAI and run your first AI model"
|
||||
+++
|
||||
|
||||
Welcome to LocalAI! This section will guide you through installation and your first steps.
|
||||
|
||||
## Quick Start
|
||||
|
||||
**New to LocalAI?** Start here:
|
||||
|
||||
1. **[Quickstart]({{% relref "docs/getting-started/quickstart" %}})** - Get LocalAI running in minutes
|
||||
2. **[Your First Chat]({{% relref "docs/tutorials/first-chat" %}})** - Complete beginner tutorial
|
||||
3. **[Try It Out]({{% relref "docs/getting-started/try-it-out" %}})** - Test the API with examples
|
||||
|
||||
## Installation Options
|
||||
|
||||
Choose the installation method that works for you:
|
||||
|
||||
- **[Quickstart]({{% relref "docs/getting-started/quickstart" %}})** - Docker, installer, or binaries
|
||||
- **[Container Images]({{% relref "docs/getting-started/container-images" %}})** - Docker deployment options
|
||||
- **[Build from Source]({{% relref "docs/getting-started/build" %}})** - Compile LocalAI yourself
|
||||
- **[Kubernetes]({{% relref "docs/getting-started/kubernetes" %}})** - Deploy on Kubernetes
|
||||
|
||||
## Setting Up Models
|
||||
|
||||
Once LocalAI is installed:
|
||||
|
||||
- **[Install and Run Models]({{% relref "docs/getting-started/models" %}})** - Model installation guide
|
||||
- **[Setting Up Models Tutorial]({{% relref "docs/tutorials/setting-up-models" %}})** - Step-by-step model setup
|
||||
- **[Customize Models]({{% relref "docs/getting-started/customize-model" %}})** - Configure model behavior
|
||||
|
||||
## What's Next?
|
||||
|
||||
After installation:
|
||||
|
||||
- Explore [Features]({{% relref "docs/features" %}}) - See what LocalAI can do
|
||||
- Follow [Tutorials]({{% relref "docs/tutorials" %}}) - Learn step-by-step
|
||||
- Check [FAQ]({{% relref "docs/faq" %}}) - Common questions
|
||||
- Read [Documentation]({{% relref "docs" %}}) - Complete reference
|
||||
|
||||
## Need Help?
|
||||
|
||||
- [FAQ]({{% relref "docs/faq" %}}) - Common questions and answers
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Solutions to problems
|
||||
- [Discord](https://discord.gg/uJAeKSAGDy) - Community support
|
||||
|
||||
@@ -7,6 +7,7 @@ icon = "rocket_launch"
|
||||
|
||||
To install models with LocalAI, you can:
|
||||
|
||||
- **Import via WebUI** (Recommended for beginners): Use the WebUI's model import interface to import models from URIs with a user-friendly interface. Supports both simple mode (with preferences) and advanced mode (YAML editor). See the [Setting Up Models tutorial]({{% relref "docs/tutorials/setting-up-models" %}}) for details.
|
||||
- Browse the Model Gallery from the Web Interface and install models with a couple of clicks. For more details, refer to the [Gallery Documentation]({{% relref "docs/features/model-gallery" %}}).
|
||||
- Specify a model from the LocalAI gallery during startup, e.g., `local-ai run <model_gallery_name>`.
|
||||
- Use a URI to specify a model file (e.g., `huggingface://...`, `oci://`, or `ollama://`) when starting LocalAI, e.g., `local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`.
|
||||
@@ -31,9 +32,29 @@ local-ai models install hermes-2-theta-llama-3-8b
|
||||
|
||||
Note: The galleries available in LocalAI can be customized to point to a different URL or a local directory. For more information on how to setup your own gallery, see the [Gallery Documentation]({{% relref "docs/features/model-gallery" %}}).
|
||||
|
||||
## Run Models via URI
|
||||
## Import Models via WebUI
|
||||
|
||||
To run models via URI, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
|
||||
The easiest way to import models is through the WebUI's import interface:
|
||||
|
||||
1. Open the LocalAI WebUI at `http://localhost:8080`
|
||||
2. Navigate to the "Models" tab
|
||||
3. Click "Import Model" or "New Model"
|
||||
4. Choose your import method:
|
||||
- **Simple Mode**: Enter a model URI and configure preferences (backend, name, description, quantizations, etc.)
|
||||
- **Advanced Mode**: Edit YAML configuration directly with syntax highlighting and validation
|
||||
|
||||
The WebUI import supports all URI types:
|
||||
- `huggingface://repository_id/model_file`
|
||||
- `oci://container_image:tag`
|
||||
- `ollama://model_id:tag`
|
||||
- `file://path/to/model`
|
||||
- `https://...` (for configuration files)
|
||||
|
||||
For detailed instructions, see the [Setting Up Models tutorial]({{% relref "docs/tutorials/setting-up-models" %}}).
|
||||
|
||||
## Run Models via URI (CLI)
|
||||
|
||||
To run models via URI from the command line, specify a URI to a model file or a configuration file when starting LocalAI. Valid syntax includes:
|
||||
|
||||
- `file://path/to/model`
|
||||
- `huggingface://repository_id/model_file` (e.g., `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`)
|
||||
@@ -172,7 +193,7 @@ curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d
|
||||
{{% alert icon="💡" %}}
|
||||
**Other Docker Images**:
|
||||
|
||||
For other Docker images, please refer to the table in [Getting Started](https://localai.io/basics/getting_started/#container-images).
|
||||
For other Docker images, please refer to the table in [Container Images]({{% relref "docs/getting-started/container-images" %}}).
|
||||
{{% /alert %}}
|
||||
|
||||
Note: If you are on Windows, ensure the project is on the Linux filesystem to avoid slow model loading. For more information, see the [Microsoft Docs](https://learn.microsoft.com/en-us/windows/wsl/filesystems).
|
||||
|
||||
@@ -70,7 +70,7 @@ You can use Docker for a quick start:
|
||||
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu
|
||||
```
|
||||
|
||||
For more detailed installation options and configurations, see our [Getting Started guide](/basics/getting_started/).
|
||||
For more detailed installation options and configurations, see our [Getting Started guide]({{% relref "docs/getting-started/quickstart" %}}).
|
||||
|
||||
## One-liner
|
||||
|
||||
@@ -104,9 +104,9 @@ LocalAI is a community-driven project. You can:
|
||||
|
||||
Ready to dive in? Here are some recommended next steps:
|
||||
|
||||
1. [Install LocalAI](/basics/getting_started/)
|
||||
1. [Install LocalAI]({{% relref "docs/getting-started/quickstart" %}})
|
||||
2. [Explore available models](https://models.localai.io)
|
||||
3. [Model compatibility](/model-compatibility/)
|
||||
3. [Model compatibility]({{% relref "docs/reference/compatibility-table" %}})
|
||||
4. [Try out examples](https://github.com/mudler/LocalAI-examples)
|
||||
5. [Join the community](https://discord.gg/uJAeKSAGDy)
|
||||
6. [Check the LocalAI Github repository](https://github.com/mudler/LocalAI)
|
||||
|
||||
445
docs/content/docs/reference/api-reference.md
Normal file
445
docs/content/docs/reference/api-reference.md
Normal file
@@ -0,0 +1,445 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "API Reference"
|
||||
weight = 22
|
||||
icon = "api"
|
||||
description = "Complete API reference for LocalAI's OpenAI-compatible endpoints"
|
||||
+++
|
||||
|
||||
LocalAI provides a REST API that is compatible with OpenAI's API specification. This document provides a complete reference for all available endpoints.
|
||||
|
||||
## Base URL
|
||||
|
||||
All API requests should be made to:
|
||||
|
||||
```
|
||||
http://localhost:8080/v1
|
||||
```
|
||||
|
||||
For production deployments, replace `localhost:8080` with your server's address.
|
||||
|
||||
## Authentication
|
||||
|
||||
If API keys are configured (via `API_KEY` environment variable), include the key in the `Authorization` header:
|
||||
|
||||
```bash
|
||||
Authorization: Bearer your-api-key
|
||||
```
|
||||
|
||||
## Endpoints
|
||||
|
||||
### Chat Completions
|
||||
|
||||
Create a model response for the given chat conversation.
|
||||
|
||||
**Endpoint**: `POST /v1/chat/completions`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "gpt-4",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful assistant."},
|
||||
{"role": "user", "content": "Hello!"}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 100,
|
||||
"top_p": 1.0,
|
||||
"top_k": 40,
|
||||
"stream": false
|
||||
}
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description | Default |
|
||||
|-----------|------|-------------|---------|
|
||||
| `model` | string | The model to use | Required |
|
||||
| `messages` | array | Array of message objects | Required |
|
||||
| `temperature` | number | Sampling temperature (0-2) | 0.7 |
|
||||
| `max_tokens` | integer | Maximum tokens to generate | Model default |
|
||||
| `top_p` | number | Nucleus sampling parameter | 1.0 |
|
||||
| `top_k` | integer | Top-k sampling parameter | 40 |
|
||||
| `stream` | boolean | Stream responses | false |
|
||||
| `tools` | array | Available tools/functions | - |
|
||||
| `tool_choice` | string | Tool selection mode | "auto" |
|
||||
|
||||
**Response**:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-123",
|
||||
"object": "chat.completion",
|
||||
"created": 1677652288,
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "Hello! How can I help you today?"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {
|
||||
"prompt_tokens": 9,
|
||||
"completion_tokens": 12,
|
||||
"total_tokens": 21
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
```
|
||||
|
||||
### Completions
|
||||
|
||||
Create a completion for the provided prompt.
|
||||
|
||||
**Endpoint**: `POST /v1/completions`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "gpt-4",
|
||||
"prompt": "The capital of France is",
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 10
|
||||
}
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `model` | string | The model to use |
|
||||
| `prompt` | string | The prompt to complete |
|
||||
| `temperature` | number | Sampling temperature |
|
||||
| `max_tokens` | integer | Maximum tokens to generate |
|
||||
| `top_p` | number | Nucleus sampling |
|
||||
| `top_k` | integer | Top-k sampling |
|
||||
| `stream` | boolean | Stream responses |
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"prompt": "The capital of France is",
|
||||
"max_tokens": 10
|
||||
}'
|
||||
```
|
||||
|
||||
### Edits
|
||||
|
||||
Create an edited version of the input.
|
||||
|
||||
**Endpoint**: `POST /v1/edits`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "gpt-4",
|
||||
"instruction": "Make it more formal",
|
||||
"input": "Hey, how are you?",
|
||||
"temperature": 0.7
|
||||
}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/edits \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"instruction": "Make it more formal",
|
||||
"input": "Hey, how are you?"
|
||||
}'
|
||||
```
|
||||
|
||||
### Embeddings
|
||||
|
||||
Get a vector representation of input text.
|
||||
|
||||
**Endpoint**: `POST /v1/embeddings`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "text-embedding-ada-002",
|
||||
"input": "The food was delicious"
|
||||
}
|
||||
```
|
||||
|
||||
**Response**:
|
||||
|
||||
```json
|
||||
{
|
||||
"object": "list",
|
||||
"data": [{
|
||||
"object": "embedding",
|
||||
"embedding": [0.1, 0.2, 0.3, ...],
|
||||
"index": 0
|
||||
}],
|
||||
"usage": {
|
||||
"prompt_tokens": 4,
|
||||
"total_tokens": 4
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "text-embedding-ada-002",
|
||||
"input": "The food was delicious"
|
||||
}'
|
||||
```
|
||||
|
||||
### Audio Transcription
|
||||
|
||||
Transcribe audio into the input language.
|
||||
|
||||
**Endpoint**: `POST /v1/audio/transcriptions`
|
||||
|
||||
**Request**: `multipart/form-data`
|
||||
|
||||
**Form Fields**:
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `file` | file | Audio file to transcribe |
|
||||
| `model` | string | Model to use (e.g., "whisper-1") |
|
||||
| `language` | string | Language code (optional) |
|
||||
| `prompt` | string | Optional text prompt |
|
||||
| `response_format` | string | Response format (json, text, etc.) |
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/audio/transcriptions \
|
||||
-H "Authorization: Bearer not-needed" \
|
||||
-F file="@audio.mp3" \
|
||||
-F model="whisper-1"
|
||||
```
|
||||
|
||||
### Audio Speech (Text-to-Speech)
|
||||
|
||||
Generate audio from text.
|
||||
|
||||
**Endpoint**: `POST /v1/audio/speech`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "tts-1",
|
||||
"input": "Hello, this is a test",
|
||||
"voice": "alloy",
|
||||
"response_format": "mp3"
|
||||
}
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `model` | string | TTS model to use |
|
||||
| `input` | string | Text to convert to speech |
|
||||
| `voice` | string | Voice to use (alloy, echo, fable, etc.) |
|
||||
| `response_format` | string | Audio format (mp3, opus, etc.) |
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/audio/speech \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "tts-1",
|
||||
"input": "Hello, this is a test",
|
||||
"voice": "alloy"
|
||||
}' \
|
||||
--output speech.mp3
|
||||
```
|
||||
|
||||
### Image Generation
|
||||
|
||||
Generate images from text prompts.
|
||||
|
||||
**Endpoint**: `POST /v1/images/generations`
|
||||
|
||||
**Request Body**:
|
||||
|
||||
```json
|
||||
{
|
||||
"prompt": "A cute baby sea otter",
|
||||
"n": 1,
|
||||
"size": "256x256",
|
||||
"response_format": "url"
|
||||
}
|
||||
```
|
||||
|
||||
**Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `prompt` | string | Text description of the image |
|
||||
| `n` | integer | Number of images to generate |
|
||||
| `size` | string | Image size (256x256, 512x512, etc.) |
|
||||
| `response_format` | string | Response format (url, b64_json) |
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/images/generations \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"prompt": "A cute baby sea otter",
|
||||
"size": "256x256"
|
||||
}'
|
||||
```
|
||||
|
||||
### List Models
|
||||
|
||||
List all available models.
|
||||
|
||||
**Endpoint**: `GET /v1/models`
|
||||
|
||||
**Query Parameters**:
|
||||
|
||||
| Parameter | Type | Description |
|
||||
|-----------|------|-------------|
|
||||
| `filter` | string | Filter models by name |
|
||||
| `excludeConfigured` | boolean | Exclude configured models |
|
||||
|
||||
**Response**:
|
||||
|
||||
```json
|
||||
{
|
||||
"object": "list",
|
||||
"data": [
|
||||
{
|
||||
"id": "gpt-4",
|
||||
"object": "model"
|
||||
},
|
||||
{
|
||||
"id": "gpt-4-vision-preview",
|
||||
"object": "model"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Example**:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/models
|
||||
```
|
||||
|
||||
## Streaming Responses
|
||||
|
||||
Many endpoints support streaming. Set `"stream": true` in the request:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"messages": [{"role": "user", "content": "Hello!"}],
|
||||
"stream": true
|
||||
}'
|
||||
```
|
||||
|
||||
Stream responses are sent as Server-Sent Events (SSE):
|
||||
|
||||
```
|
||||
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
|
||||
|
||||
data: {"id":"chatcmpl-123","object":"chat.completion.chunk",...}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"error": {
|
||||
"message": "Error description",
|
||||
"type": "invalid_request_error",
|
||||
"code": 400
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Common Error Codes
|
||||
|
||||
| Code | Description |
|
||||
|------|-------------|
|
||||
| 400 | Bad Request - Invalid parameters |
|
||||
| 401 | Unauthorized - Missing or invalid API key |
|
||||
| 404 | Not Found - Model or endpoint not found |
|
||||
| 429 | Too Many Requests - Rate limit exceeded |
|
||||
| 500 | Internal Server Error - Server error |
|
||||
| 503 | Service Unavailable - Model not loaded |
|
||||
|
||||
### Example Error Handling
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
try:
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={"model": "gpt-4", "messages": [...]},
|
||||
timeout=30
|
||||
)
|
||||
response.raise_for_status()
|
||||
data = response.json()
|
||||
except requests.exceptions.HTTPError as e:
|
||||
if e.response.status_code == 404:
|
||||
print("Model not found")
|
||||
elif e.response.status_code == 503:
|
||||
print("Model not loaded")
|
||||
else:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
LocalAI doesn't enforce rate limiting by default. For production deployments, implement rate limiting at the reverse proxy or application level.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use appropriate timeouts**: Set reasonable timeouts for requests
|
||||
2. **Handle errors gracefully**: Implement retry logic with exponential backoff
|
||||
3. **Monitor token usage**: Track `usage` fields in responses
|
||||
4. **Use streaming for long responses**: Enable streaming for better user experience
|
||||
5. **Cache embeddings**: Cache embedding results when possible
|
||||
6. **Batch requests**: Process multiple items together when possible
|
||||
|
||||
## See Also
|
||||
|
||||
- [OpenAI API Documentation](https://platform.openai.com/docs/api-reference) - Original OpenAI API reference
|
||||
- [Try It Out]({{% relref "docs/getting-started/try-it-out" %}}) - Interactive examples
|
||||
- [Integration Examples]({{% relref "docs/tutorials/integration-examples" %}}) - Framework integrations
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - API issues
|
||||
|
||||
318
docs/content/docs/security.md
Normal file
318
docs/content/docs/security.md
Normal file
@@ -0,0 +1,318 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Security Best Practices"
|
||||
weight = 26
|
||||
icon = "security"
|
||||
description = "Security guidelines for deploying LocalAI"
|
||||
+++
|
||||
|
||||
This guide covers security best practices for deploying LocalAI in various environments, from local development to production.
|
||||
|
||||
## Overview
|
||||
|
||||
LocalAI processes sensitive data and may be exposed to networks. Follow these practices to secure your deployment.
|
||||
|
||||
## API Key Protection
|
||||
|
||||
### Always Use API Keys in Production
|
||||
|
||||
**Never expose LocalAI without API keys**:
|
||||
|
||||
```bash
|
||||
# Set API key
|
||||
API_KEY=your-secure-random-key local-ai
|
||||
|
||||
# Multiple keys (comma-separated)
|
||||
API_KEY=key1,key2,key3 local-ai
|
||||
```
|
||||
|
||||
### API Key Best Practices
|
||||
|
||||
1. **Generate strong keys**: Use cryptographically secure random strings
|
||||
```bash
|
||||
# Generate a secure key
|
||||
openssl rand -hex 32
|
||||
```
|
||||
|
||||
2. **Store securely**:
|
||||
- Use environment variables
|
||||
- Use secrets management (Kubernetes Secrets, HashiCorp Vault, etc.)
|
||||
- Never commit keys to version control
|
||||
|
||||
3. **Rotate regularly**: Change API keys periodically
|
||||
|
||||
4. **Use different keys**: Different keys for different services/clients
|
||||
|
||||
5. **Limit key scope**: Consider implementing key-based rate limiting
|
||||
|
||||
### Using API Keys
|
||||
|
||||
Include the key in requests:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/models \
|
||||
-H "Authorization: Bearer your-api-key"
|
||||
```
|
||||
|
||||
**Important**: API keys provide full access to all LocalAI features (admin-level). Protect them accordingly.
|
||||
|
||||
## Network Security
|
||||
|
||||
### Never Expose Directly to Internet
|
||||
|
||||
**Always use a reverse proxy** when exposing LocalAI:
|
||||
|
||||
```nginx
|
||||
# nginx example
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name localai.example.com;
|
||||
|
||||
ssl_certificate /path/to/cert.pem;
|
||||
ssl_certificate_key /path/to/key.pem;
|
||||
|
||||
location / {
|
||||
proxy_pass http://localhost:8080;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Use HTTPS/TLS
|
||||
|
||||
**Always use HTTPS in production**:
|
||||
|
||||
1. Obtain SSL/TLS certificates (Let's Encrypt, etc.)
|
||||
2. Configure reverse proxy with TLS
|
||||
3. Enforce HTTPS redirects
|
||||
4. Use strong cipher suites
|
||||
|
||||
### Firewall Configuration
|
||||
|
||||
Restrict access with firewall rules:
|
||||
|
||||
```bash
|
||||
# Allow only specific IPs (example)
|
||||
ufw allow from 192.168.1.0/24 to any port 8080
|
||||
|
||||
# Or use iptables
|
||||
iptables -A INPUT -p tcp --dport 8080 -s 192.168.1.0/24 -j ACCEPT
|
||||
iptables -A INPUT -p tcp --dport 8080 -j DROP
|
||||
```
|
||||
|
||||
### VPN or Private Network
|
||||
|
||||
For sensitive deployments:
|
||||
- Use VPN for remote access
|
||||
- Deploy on private network only
|
||||
- Use network segmentation
|
||||
|
||||
## Model Security
|
||||
|
||||
### Model Source Verification
|
||||
|
||||
**Only use trusted model sources**:
|
||||
|
||||
1. **Official galleries**: Use LocalAI's model gallery
|
||||
2. **Verified repositories**: Hugging Face verified models
|
||||
3. **Verify checksums**: Check SHA256 hashes when provided
|
||||
4. **Scan for malware**: Scan downloaded files
|
||||
|
||||
### Model Isolation
|
||||
|
||||
- Run models in isolated environments
|
||||
- Use containers with limited permissions
|
||||
- Separate model storage from system
|
||||
|
||||
### Model Access Control
|
||||
|
||||
- Restrict file system access to models
|
||||
- Use appropriate file permissions
|
||||
- Consider read-only model storage
|
||||
|
||||
## Container Security
|
||||
|
||||
### Use Non-Root User
|
||||
|
||||
Run containers as non-root:
|
||||
|
||||
```yaml
|
||||
# Docker Compose
|
||||
services:
|
||||
localai:
|
||||
user: "1000:1000" # Non-root UID/GID
|
||||
```
|
||||
|
||||
### Limit Container Capabilities
|
||||
|
||||
```yaml
|
||||
services:
|
||||
localai:
|
||||
cap_drop:
|
||||
- ALL
|
||||
cap_add:
|
||||
- NET_BIND_SERVICE # Only what's needed
|
||||
```
|
||||
|
||||
### Resource Limits
|
||||
|
||||
Set resource limits to prevent resource exhaustion:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
localai:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '4'
|
||||
memory: 16G
|
||||
```
|
||||
|
||||
### Read-Only Filesystem
|
||||
|
||||
Where possible, use read-only filesystem:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
localai:
|
||||
read_only: true
|
||||
tmpfs:
|
||||
- /tmp
|
||||
- /var/run
|
||||
```
|
||||
|
||||
## Input Validation
|
||||
|
||||
### Sanitize Inputs
|
||||
|
||||
Validate and sanitize all inputs:
|
||||
- Check input length limits
|
||||
- Validate data formats
|
||||
- Sanitize user prompts
|
||||
- Implement rate limiting
|
||||
|
||||
### File Upload Security
|
||||
|
||||
If accepting file uploads:
|
||||
- Validate file types
|
||||
- Limit file sizes
|
||||
- Scan for malware
|
||||
- Store in isolated location
|
||||
|
||||
## Logging and Monitoring
|
||||
|
||||
### Secure Logging
|
||||
|
||||
- Don't log sensitive data (API keys, user inputs)
|
||||
- Use secure log storage
|
||||
- Implement log rotation
|
||||
- Monitor for suspicious activity
|
||||
|
||||
### Monitoring
|
||||
|
||||
Monitor for:
|
||||
- Unusual API usage patterns
|
||||
- Failed authentication attempts
|
||||
- Resource exhaustion
|
||||
- Error rate spikes
|
||||
|
||||
## Updates and Maintenance
|
||||
|
||||
### Keep Updated
|
||||
|
||||
- Regularly update LocalAI
|
||||
- Update dependencies
|
||||
- Patch security vulnerabilities
|
||||
- Monitor security advisories
|
||||
|
||||
### Backup Security
|
||||
|
||||
- Encrypt backups
|
||||
- Secure backup storage
|
||||
- Test restore procedures
|
||||
- Limit backup access
|
||||
|
||||
## Deployment-Specific Security
|
||||
|
||||
### Kubernetes
|
||||
|
||||
- Use NetworkPolicies
|
||||
- Implement RBAC
|
||||
- Use Secrets for sensitive data
|
||||
- Enable Pod Security Policies
|
||||
- Use service mesh for mTLS
|
||||
|
||||
### Docker
|
||||
|
||||
- Use official images
|
||||
- Scan images for vulnerabilities
|
||||
- Keep images updated
|
||||
- Use Docker secrets
|
||||
- Implement health checks
|
||||
|
||||
### Systemd
|
||||
|
||||
- Run as dedicated user
|
||||
- Limit systemd service capabilities
|
||||
- Use PrivateTmp, ProtectSystem
|
||||
- Restrict network access
|
||||
|
||||
## Security Checklist
|
||||
|
||||
Before deploying to production:
|
||||
|
||||
- [ ] API keys configured and secured
|
||||
- [ ] HTTPS/TLS enabled
|
||||
- [ ] Reverse proxy configured
|
||||
- [ ] Firewall rules set
|
||||
- [ ] Network access restricted
|
||||
- [ ] Container security hardened
|
||||
- [ ] Resource limits configured
|
||||
- [ ] Logging configured securely
|
||||
- [ ] Monitoring in place
|
||||
- [ ] Updates planned
|
||||
- [ ] Backup security ensured
|
||||
- [ ] Incident response plan ready
|
||||
|
||||
## Incident Response
|
||||
|
||||
### If Compromised
|
||||
|
||||
1. **Isolate**: Immediately disconnect from network
|
||||
2. **Assess**: Determine scope of compromise
|
||||
3. **Contain**: Prevent further damage
|
||||
4. **Eradicate**: Remove threats
|
||||
5. **Recover**: Restore from clean backups
|
||||
6. **Learn**: Document and improve
|
||||
|
||||
### Security Contacts
|
||||
|
||||
- Report security issues: [GitHub Security](https://github.com/mudler/LocalAI/security)
|
||||
- Security discussions: [Discord](https://discord.gg/uJAeKSAGDy)
|
||||
|
||||
## Compliance Considerations
|
||||
|
||||
### Data Privacy
|
||||
|
||||
- Understand data processing
|
||||
- Implement data retention policies
|
||||
- Consider GDPR, CCPA requirements
|
||||
- Document data flows
|
||||
|
||||
### Audit Logging
|
||||
|
||||
- Log all API access
|
||||
- Track model usage
|
||||
- Monitor configuration changes
|
||||
- Retain logs appropriately
|
||||
|
||||
## See Also
|
||||
|
||||
- [Deploying to Production]({{% relref "docs/tutorials/deploying-production" %}}) - Production deployment
|
||||
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - API security
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Security issues
|
||||
- [FAQ]({{% relref "docs/faq" %}}) - Security questions
|
||||
|
||||
392
docs/content/docs/troubleshooting.md
Normal file
392
docs/content/docs/troubleshooting.md
Normal file
@@ -0,0 +1,392 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Troubleshooting Guide"
|
||||
weight = 25
|
||||
icon = "bug_report"
|
||||
description = "Solutions to common problems and issues with LocalAI"
|
||||
+++
|
||||
|
||||
This guide helps you diagnose and fix common issues with LocalAI. If you can't find a solution here, check the [FAQ]({{% relref "docs/faq" %}}) or ask for help on [Discord](https://discord.gg/uJAeKSAGDy).
|
||||
|
||||
## Getting Help
|
||||
|
||||
Before asking for help, gather this information:
|
||||
|
||||
1. **LocalAI version**: `local-ai --version` or check container image tag
|
||||
2. **System information**: OS, CPU, RAM, GPU (if applicable)
|
||||
3. **Error messages**: Full error output with `DEBUG=true`
|
||||
4. **Configuration**: Relevant model configuration files
|
||||
5. **Logs**: Enable debug mode and capture logs
|
||||
|
||||
## Common Issues
|
||||
|
||||
### Model Not Loading
|
||||
|
||||
**Symptoms**: Model appears in list but fails to load or respond
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check backend installation**:
|
||||
```bash
|
||||
local-ai backends list
|
||||
local-ai backends install <backend-name> # if missing
|
||||
```
|
||||
|
||||
2. **Verify model file**:
|
||||
- Check file exists and is not corrupted
|
||||
- Verify file format (GGUF recommended)
|
||||
- Re-download if corrupted
|
||||
|
||||
3. **Check memory**:
|
||||
- Ensure sufficient RAM available
|
||||
- Try smaller quantization (Q4_K_S instead of Q8_0)
|
||||
- Reduce `context_size` in configuration
|
||||
|
||||
4. **Check logs**:
|
||||
```bash
|
||||
DEBUG=true local-ai
|
||||
```
|
||||
Look for specific error messages
|
||||
|
||||
5. **Verify backend compatibility**:
|
||||
- Check [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}})
|
||||
- Ensure correct backend specified in model config
|
||||
|
||||
### Out of Memory Errors
|
||||
|
||||
**Symptoms**: Errors about memory, crashes, or very slow performance
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Reduce model size**:
|
||||
- Use smaller quantization (Q2_K, Q4_K_S)
|
||||
- Use smaller models (1-3B instead of 7B+)
|
||||
|
||||
2. **Adjust configuration**:
|
||||
```yaml
|
||||
context_size: 1024 # Reduce from default
|
||||
gpu_layers: 20 # Reduce GPU layers if using GPU
|
||||
```
|
||||
|
||||
3. **Free system memory**:
|
||||
- Close other applications
|
||||
- Reduce number of loaded models
|
||||
- Use `--single-active-backend` flag
|
||||
|
||||
4. **Check system limits**:
|
||||
```bash
|
||||
# Linux
|
||||
free -h
|
||||
ulimit -a
|
||||
```
|
||||
|
||||
### Slow Performance
|
||||
|
||||
**Symptoms**: Very slow responses, low tokens/second
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check hardware**:
|
||||
- Use SSD instead of HDD for model storage
|
||||
- Ensure adequate CPU cores
|
||||
- Enable GPU acceleration if available
|
||||
|
||||
2. **Optimize configuration**:
|
||||
```yaml
|
||||
threads: 4 # Match CPU cores
|
||||
gpu_layers: 35 # Offload to GPU if available
|
||||
mmap: true # Enable memory mapping
|
||||
```
|
||||
|
||||
3. **Check for bottlenecks**:
|
||||
```bash
|
||||
# Monitor CPU
|
||||
top
|
||||
|
||||
# Monitor GPU (NVIDIA)
|
||||
nvidia-smi
|
||||
|
||||
# Monitor disk I/O
|
||||
iostat
|
||||
```
|
||||
|
||||
4. **Disable unnecessary features**:
|
||||
- Set `mirostat: 0` if not needed
|
||||
- Reduce context size
|
||||
- Use smaller models
|
||||
|
||||
5. **Check network**: If using remote models, check network latency
|
||||
|
||||
### GPU Not Working
|
||||
|
||||
**Symptoms**: GPU not detected, no GPU usage, or CUDA errors
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Verify GPU drivers**:
|
||||
```bash
|
||||
# NVIDIA
|
||||
nvidia-smi
|
||||
|
||||
# AMD
|
||||
rocm-smi
|
||||
```
|
||||
|
||||
2. **Check Docker GPU access**:
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
|
||||
```
|
||||
|
||||
3. **Use correct image**:
|
||||
- NVIDIA: `localai/localai:latest-gpu-nvidia-cuda-12`
|
||||
- AMD: `localai/localai:latest-gpu-hipblas`
|
||||
- Intel: `localai/localai:latest-gpu-intel`
|
||||
|
||||
4. **Configure GPU layers**:
|
||||
```yaml
|
||||
gpu_layers: 35 # Adjust based on GPU memory
|
||||
f16: true
|
||||
```
|
||||
|
||||
5. **Check CUDA version**: Ensure CUDA version matches (11.7 vs 12.0)
|
||||
|
||||
6. **Check logs**: Enable debug mode to see GPU initialization messages
|
||||
|
||||
### API Errors
|
||||
|
||||
**Symptoms**: 400, 404, 500, or 503 errors from API
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **404 - Model Not Found**:
|
||||
- Verify model name is correct
|
||||
- Check model is installed: `curl http://localhost:8080/v1/models`
|
||||
- Ensure model file exists in models directory
|
||||
|
||||
2. **503 - Service Unavailable**:
|
||||
- Model may not be loaded yet (wait a moment)
|
||||
- Check if model failed to load (check logs)
|
||||
- Verify backend is installed
|
||||
|
||||
3. **400 - Bad Request**:
|
||||
- Check request format matches API specification
|
||||
- Verify all required parameters are present
|
||||
- Check parameter types and values
|
||||
|
||||
4. **500 - Internal Server Error**:
|
||||
- Enable debug mode: `DEBUG=true`
|
||||
- Check logs for specific error
|
||||
- Verify model configuration is valid
|
||||
|
||||
5. **401 - Unauthorized**:
|
||||
- Check if API key is required
|
||||
- Verify API key is correct
|
||||
- Include Authorization header if needed
|
||||
|
||||
### Installation Issues
|
||||
|
||||
**Symptoms**: Installation fails or LocalAI won't start
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Docker issues**:
|
||||
```bash
|
||||
# Check Docker is running
|
||||
docker ps
|
||||
|
||||
# Check image exists
|
||||
docker images | grep localai
|
||||
|
||||
# Pull latest image
|
||||
docker pull localai/localai:latest
|
||||
```
|
||||
|
||||
2. **Permission issues**:
|
||||
```bash
|
||||
# Check file permissions
|
||||
ls -la models/
|
||||
|
||||
# Fix permissions if needed
|
||||
chmod -R 755 models/
|
||||
```
|
||||
|
||||
3. **Port already in use**:
|
||||
```bash
|
||||
# Find process using port
|
||||
lsof -i :8080
|
||||
|
||||
# Use different port
|
||||
docker run -p 8081:8080 ...
|
||||
```
|
||||
|
||||
4. **Binary not found**:
|
||||
- Verify binary is in PATH
|
||||
- Check binary has execute permissions
|
||||
- Reinstall if needed
|
||||
|
||||
### Backend Issues
|
||||
|
||||
**Symptoms**: Backend fails to install or load
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Check backend availability**:
|
||||
```bash
|
||||
local-ai backends list
|
||||
```
|
||||
|
||||
2. **Manual installation**:
|
||||
```bash
|
||||
local-ai backends install <backend-name>
|
||||
```
|
||||
|
||||
3. **Check network**: Backend download requires internet connection
|
||||
|
||||
4. **Check disk space**: Ensure sufficient space for backend files
|
||||
|
||||
5. **Rebuild if needed**:
|
||||
```bash
|
||||
REBUILD=true local-ai
|
||||
```
|
||||
|
||||
### Configuration Issues
|
||||
|
||||
**Symptoms**: Models not working as expected, wrong behavior
|
||||
|
||||
**Solutions**:
|
||||
|
||||
1. **Validate YAML syntax**:
|
||||
```bash
|
||||
# Check YAML is valid
|
||||
yamllint model.yaml
|
||||
```
|
||||
|
||||
2. **Check configuration reference**:
|
||||
- See [Model Configuration]({{% relref "docs/advanced/model-configuration" %}})
|
||||
- Verify all parameters are correct
|
||||
|
||||
3. **Test with minimal config**:
|
||||
- Start with basic configuration
|
||||
- Add parameters one at a time
|
||||
|
||||
4. **Check template files**:
|
||||
- Verify template syntax
|
||||
- Check template matches model type
|
||||
|
||||
## Debugging Tips
|
||||
|
||||
### Enable Debug Mode
|
||||
|
||||
```bash
|
||||
# Environment variable
|
||||
DEBUG=true local-ai
|
||||
|
||||
# Command line flag
|
||||
local-ai --debug
|
||||
|
||||
# Docker
|
||||
docker run -e DEBUG=true ...
|
||||
```
|
||||
|
||||
### Check Logs
|
||||
|
||||
```bash
|
||||
# Docker logs
|
||||
docker logs local-ai
|
||||
|
||||
# Systemd logs
|
||||
journalctl -u localai -f
|
||||
|
||||
# Direct output
|
||||
local-ai 2>&1 | tee localai.log
|
||||
```
|
||||
|
||||
### Test API Endpoints
|
||||
|
||||
```bash
|
||||
# Health check
|
||||
curl http://localhost:8080/healthz
|
||||
|
||||
# Readiness check
|
||||
curl http://localhost:8080/readyz
|
||||
|
||||
# List models
|
||||
curl http://localhost:8080/v1/models
|
||||
|
||||
# Test chat
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "gpt-4", "messages": [{"role": "user", "content": "test"}]}'
|
||||
```
|
||||
|
||||
### Monitor Resources
|
||||
|
||||
```bash
|
||||
# CPU and memory
|
||||
htop
|
||||
|
||||
# GPU (NVIDIA)
|
||||
watch -n 1 nvidia-smi
|
||||
|
||||
# Disk usage
|
||||
df -h
|
||||
du -sh models/
|
||||
|
||||
# Network
|
||||
iftop
|
||||
```
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow Inference
|
||||
|
||||
1. **Check token speed**: Look for tokens/second in debug logs
|
||||
2. **Optimize threads**: Match CPU cores
|
||||
3. **Enable GPU**: Use GPU acceleration
|
||||
4. **Reduce context**: Smaller context = faster inference
|
||||
5. **Use quantization**: Q4_K_M is good balance
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
1. **Use smaller models**: 1-3B instead of 7B+
|
||||
2. **Lower quantization**: Q2_K uses less memory
|
||||
3. **Reduce context size**: Smaller context = less memory
|
||||
4. **Disable mmap**: Set `mmap: false` (slower but uses less memory)
|
||||
5. **Unload unused models**: Only load models you're using
|
||||
|
||||
## Platform-Specific Issues
|
||||
|
||||
### macOS
|
||||
|
||||
- **Quarantine warnings**: See [FAQ]({{% relref "docs/faq" %}})
|
||||
- **Metal not working**: Ensure Xcode is installed
|
||||
- **Docker performance**: Consider building from source for better performance
|
||||
|
||||
### Linux
|
||||
|
||||
- **Permission denied**: Check file permissions and SELinux
|
||||
- **Missing libraries**: Install required system libraries
|
||||
- **Systemd issues**: Check service status and logs
|
||||
|
||||
### Windows/WSL
|
||||
|
||||
- **Slow model loading**: Ensure files are on Linux filesystem
|
||||
- **GPU access**: May require WSL2 with GPU support
|
||||
- **Path issues**: Use forward slashes in paths
|
||||
|
||||
## Getting More Help
|
||||
|
||||
If you've tried the solutions above and still have issues:
|
||||
|
||||
1. **Check GitHub Issues**: Search [GitHub Issues](https://github.com/mudler/LocalAI/issues)
|
||||
2. **Ask on Discord**: Join [Discord](https://discord.gg/uJAeKSAGDy)
|
||||
3. **Create an Issue**: Provide all debugging information
|
||||
4. **Check Documentation**: Review relevant documentation sections
|
||||
|
||||
## See Also
|
||||
|
||||
- [FAQ]({{% relref "docs/faq" %}}) - Common questions
|
||||
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize performance
|
||||
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - GPU memory management
|
||||
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Configuration reference
|
||||
|
||||
34
docs/content/docs/tutorials/_index.en.md
Normal file
34
docs/content/docs/tutorials/_index.en.md
Normal file
@@ -0,0 +1,34 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Tutorials"
|
||||
weight = 5
|
||||
icon = "school"
|
||||
description = "Step-by-step guides to help you get started with LocalAI"
|
||||
+++
|
||||
|
||||
Welcome to the LocalAI tutorials section! These step-by-step guides will help you learn how to use LocalAI effectively, from your first chat to deploying in production.
|
||||
|
||||
## Getting Started Tutorials
|
||||
|
||||
Start here if you're new to LocalAI:
|
||||
|
||||
1. **[Your First Chat]({{% relref "docs/tutorials/first-chat" %}})** - Learn how to install LocalAI and have your first conversation with an AI model
|
||||
2. **[Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})** - A comprehensive guide to installing and configuring models
|
||||
3. **[Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}})** - Set up GPU support for faster inference
|
||||
|
||||
## Advanced Tutorials
|
||||
|
||||
Ready to take it further?
|
||||
|
||||
4. **[Deploying to Production]({{% relref "docs/tutorials/deploying-production" %}})** - Best practices for running LocalAI in production environments
|
||||
5. **[Integration Examples]({{% relref "docs/tutorials/integration-examples" %}})** - Learn how to integrate LocalAI with popular frameworks and tools
|
||||
|
||||
## What's Next?
|
||||
|
||||
After completing the tutorials, explore:
|
||||
|
||||
- [Features Documentation]({{% relref "docs/features" %}}) - Detailed information about all LocalAI capabilities
|
||||
- [Advanced Configuration]({{% relref "docs/advanced" %}}) - Fine-tune your setup
|
||||
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
|
||||
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}) - Solutions to common problems
|
||||
|
||||
355
docs/content/docs/tutorials/deploying-production.md
Normal file
355
docs/content/docs/tutorials/deploying-production.md
Normal file
@@ -0,0 +1,355 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Deploying to Production"
|
||||
weight = 4
|
||||
icon = "rocket_launch"
|
||||
description = "Best practices for running LocalAI in production environments"
|
||||
+++
|
||||
|
||||
This tutorial covers best practices for deploying LocalAI in production environments, including security, performance, monitoring, and reliability considerations.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- LocalAI installed and tested
|
||||
- Understanding of your deployment environment
|
||||
- Basic knowledge of Docker, Kubernetes, or your chosen deployment method
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### 1. API Key Protection
|
||||
|
||||
**Always use API keys in production**:
|
||||
|
||||
```bash
|
||||
# Set API key
|
||||
API_KEY=your-secure-random-key local-ai
|
||||
|
||||
# Or multiple keys
|
||||
API_KEY=key1,key2,key3 local-ai
|
||||
```
|
||||
|
||||
**Best Practices**:
|
||||
- Use strong, randomly generated keys
|
||||
- Store keys securely (environment variables, secrets management)
|
||||
- Rotate keys regularly
|
||||
- Use different keys for different services/clients
|
||||
|
||||
### 2. Network Security
|
||||
|
||||
**Never expose LocalAI directly to the internet** without protection:
|
||||
|
||||
- Use a reverse proxy (nginx, Traefik, Caddy)
|
||||
- Enable HTTPS/TLS
|
||||
- Use firewall rules to restrict access
|
||||
- Consider VPN or private network access only
|
||||
|
||||
**Example nginx configuration**:
|
||||
|
||||
```nginx
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name localai.example.com;
|
||||
|
||||
ssl_certificate /path/to/cert.pem;
|
||||
ssl_certificate_key /path/to/key.pem;
|
||||
|
||||
location / {
|
||||
proxy_pass http://localhost:8080;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Resource Limits
|
||||
|
||||
Set appropriate resource limits to prevent resource exhaustion:
|
||||
|
||||
```yaml
|
||||
# Docker Compose example
|
||||
services:
|
||||
localai:
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '4'
|
||||
memory: 16G
|
||||
reservations:
|
||||
cpus: '2'
|
||||
memory: 8G
|
||||
```
|
||||
|
||||
## Deployment Methods
|
||||
|
||||
### Docker Compose (Recommended for Small-Medium Deployments)
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
localai:
|
||||
image: localai/localai:latest
|
||||
ports:
|
||||
- "8080:8080"
|
||||
environment:
|
||||
- API_KEY=${API_KEY}
|
||||
- DEBUG=false
|
||||
- MODELS_PATH=/models
|
||||
volumes:
|
||||
- ./models:/models
|
||||
- ./config:/config
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 16G
|
||||
```
|
||||
|
||||
### Kubernetes
|
||||
|
||||
See the [Kubernetes Deployment Guide]({{% relref "docs/getting-started/kubernetes" %}}) for detailed instructions.
|
||||
|
||||
**Key considerations**:
|
||||
- Use ConfigMaps for configuration
|
||||
- Use Secrets for API keys
|
||||
- Set resource requests and limits
|
||||
- Configure health checks and liveness probes
|
||||
- Use PersistentVolumes for model storage
|
||||
|
||||
### Systemd Service (Linux)
|
||||
|
||||
Create a systemd service file:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=LocalAI Service
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=localai
|
||||
Environment="API_KEY=your-key"
|
||||
Environment="MODELS_PATH=/var/lib/localai/models"
|
||||
ExecStart=/usr/local/bin/local-ai
|
||||
Restart=always
|
||||
RestartSec=10
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### 1. Model Selection
|
||||
|
||||
- Use quantized models (Q4_K_M) for production
|
||||
- Choose models appropriate for your hardware
|
||||
- Consider model size vs. quality trade-offs
|
||||
|
||||
### 2. Resource Allocation
|
||||
|
||||
```yaml
|
||||
# Model configuration
|
||||
name: production-model
|
||||
parameters:
|
||||
model: model.gguf
|
||||
context_size: 2048 # Adjust based on needs
|
||||
threads: 4 # Match CPU cores
|
||||
gpu_layers: 35 # If using GPU
|
||||
```
|
||||
|
||||
### 3. Caching
|
||||
|
||||
Enable prompt caching for repeated queries:
|
||||
|
||||
```yaml
|
||||
prompt_cache_path: "cache"
|
||||
prompt_cache_all: true
|
||||
```
|
||||
|
||||
### 4. Connection Pooling
|
||||
|
||||
If using a reverse proxy, configure connection pooling:
|
||||
|
||||
```nginx
|
||||
upstream localai {
|
||||
least_conn;
|
||||
server localhost:8080 max_fails=3 fail_timeout=30s;
|
||||
keepalive 32;
|
||||
}
|
||||
```
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
### 1. Health Checks
|
||||
|
||||
LocalAI provides health check endpoints:
|
||||
|
||||
```bash
|
||||
# Readiness check
|
||||
curl http://localhost:8080/readyz
|
||||
|
||||
# Health check
|
||||
curl http://localhost:8080/healthz
|
||||
```
|
||||
|
||||
### 2. Logging
|
||||
|
||||
Configure appropriate log levels:
|
||||
|
||||
```bash
|
||||
# Production: minimal logging
|
||||
DEBUG=false local-ai
|
||||
|
||||
# Development: detailed logging
|
||||
DEBUG=true local-ai
|
||||
```
|
||||
|
||||
### 3. Metrics
|
||||
|
||||
Monitor key metrics:
|
||||
- Request rate
|
||||
- Response times
|
||||
- Error rates
|
||||
- Resource usage (CPU, memory, GPU)
|
||||
- Model loading times
|
||||
|
||||
### 4. Alerting
|
||||
|
||||
Set up alerts for:
|
||||
- Service downtime
|
||||
- High error rates
|
||||
- Resource exhaustion
|
||||
- Slow response times
|
||||
|
||||
## High Availability
|
||||
|
||||
### 1. Multiple Instances
|
||||
|
||||
Run multiple LocalAI instances behind a load balancer:
|
||||
|
||||
```yaml
|
||||
# Docker Compose with multiple instances
|
||||
services:
|
||||
localai1:
|
||||
image: localai/localai:latest
|
||||
# ... configuration
|
||||
|
||||
localai2:
|
||||
image: localai/localai:latest
|
||||
# ... configuration
|
||||
|
||||
nginx:
|
||||
image: nginx:alpine
|
||||
# Load balance between localai1 and localai2
|
||||
```
|
||||
|
||||
### 2. Model Replication
|
||||
|
||||
Ensure models are available on all instances:
|
||||
- Shared storage (NFS, S3, etc.)
|
||||
- Model synchronization
|
||||
- Consistent model versions
|
||||
|
||||
### 3. Graceful Shutdown
|
||||
|
||||
LocalAI supports graceful shutdown. Ensure your deployment method handles SIGTERM properly.
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### 1. Model Backups
|
||||
|
||||
Regularly backup your models and configurations:
|
||||
|
||||
```bash
|
||||
# Backup models
|
||||
tar -czf models-backup-$(date +%Y%m%d).tar.gz models/
|
||||
|
||||
# Backup configurations
|
||||
tar -czf config-backup-$(date +%Y%m%d).tar.gz config/
|
||||
```
|
||||
|
||||
### 2. Configuration Management
|
||||
|
||||
Version control your configurations:
|
||||
- Use Git for YAML configurations
|
||||
- Document model versions
|
||||
- Track configuration changes
|
||||
|
||||
### 3. Disaster Recovery
|
||||
|
||||
Plan for:
|
||||
- Model storage recovery
|
||||
- Configuration restoration
|
||||
- Service restoration procedures
|
||||
|
||||
## Scaling Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
- Run multiple instances
|
||||
- Use load balancing
|
||||
- Consider stateless design (shared model storage)
|
||||
|
||||
### Vertical Scaling
|
||||
|
||||
- Increase resources (CPU, RAM, GPU)
|
||||
- Use more powerful hardware
|
||||
- Optimize model configurations
|
||||
|
||||
## Maintenance
|
||||
|
||||
### 1. Updates
|
||||
|
||||
- Test updates in staging first
|
||||
- Plan maintenance windows
|
||||
- Have rollback procedures ready
|
||||
|
||||
### 2. Model Updates
|
||||
|
||||
- Test new models before production
|
||||
- Keep model versions documented
|
||||
- Have rollback capability
|
||||
|
||||
### 3. Monitoring
|
||||
|
||||
Regularly review:
|
||||
- Performance metrics
|
||||
- Error logs
|
||||
- Resource usage trends
|
||||
- User feedback
|
||||
|
||||
## Production Checklist
|
||||
|
||||
Before going live, ensure:
|
||||
|
||||
- [ ] API keys configured and secured
|
||||
- [ ] HTTPS/TLS enabled
|
||||
- [ ] Firewall rules configured
|
||||
- [ ] Resource limits set
|
||||
- [ ] Health checks configured
|
||||
- [ ] Monitoring in place
|
||||
- [ ] Logging configured
|
||||
- [ ] Backups scheduled
|
||||
- [ ] Documentation updated
|
||||
- [ ] Team trained on operations
|
||||
- [ ] Incident response plan ready
|
||||
|
||||
## What's Next?
|
||||
|
||||
- [Kubernetes Deployment]({{% relref "docs/getting-started/kubernetes" %}}) - Deploy on Kubernetes
|
||||
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize performance
|
||||
- [Security Best Practices]({{% relref "docs/security" %}}) - Security guidelines
|
||||
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}}) - Production issues
|
||||
|
||||
## See Also
|
||||
|
||||
- [Container Images]({{% relref "docs/getting-started/container-images" %}})
|
||||
- [Advanced Configuration]({{% relref "docs/advanced" %}})
|
||||
- [FAQ]({{% relref "docs/faq" %}})
|
||||
|
||||
171
docs/content/docs/tutorials/first-chat.md
Normal file
171
docs/content/docs/tutorials/first-chat.md
Normal file
@@ -0,0 +1,171 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Your First Chat with LocalAI"
|
||||
weight = 1
|
||||
icon = "chat"
|
||||
description = "Get LocalAI running and have your first conversation in minutes"
|
||||
+++
|
||||
|
||||
This tutorial will guide you through installing LocalAI and having your first conversation with an AI model. By the end, you'll have LocalAI running and be able to chat with a local AI model.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A computer running Linux, macOS, or Windows (with WSL)
|
||||
- At least 4GB of RAM (8GB+ recommended)
|
||||
- Docker installed (optional, but recommended for easiest setup)
|
||||
|
||||
## Step 1: Install LocalAI
|
||||
|
||||
Choose the installation method that works best for you:
|
||||
|
||||
### Option A: Docker (Recommended for Beginners)
|
||||
|
||||
```bash
|
||||
# Run LocalAI with AIO (All-in-One) image - includes pre-configured models
|
||||
docker run -p 8080:8080 --name local-ai -ti localai/localai:latest-aio-cpu
|
||||
```
|
||||
|
||||
This will:
|
||||
- Download the LocalAI image
|
||||
- Start the API server on port 8080
|
||||
- Automatically download and configure models
|
||||
|
||||
### Option B: Quick Install Script (Linux)
|
||||
|
||||
```bash
|
||||
curl https://localai.io/install.sh | sh
|
||||
```
|
||||
|
||||
### Option C: macOS DMG
|
||||
|
||||
Download the DMG from [GitHub Releases](https://github.com/mudler/LocalAI/releases/latest/download/LocalAI.dmg) and install it.
|
||||
|
||||
For more installation options, see the [Quickstart Guide]({{% relref "docs/getting-started/quickstart" %}}).
|
||||
|
||||
## Step 2: Verify Installation
|
||||
|
||||
Once LocalAI is running, verify it's working:
|
||||
|
||||
```bash
|
||||
# Check if the API is responding
|
||||
curl http://localhost:8080/v1/models
|
||||
```
|
||||
|
||||
You should see a JSON response listing available models. If using the AIO image, you'll see models like `gpt-4`, `gpt-4-vision-preview`, etc.
|
||||
|
||||
## Step 3: Access the WebUI
|
||||
|
||||
Open your web browser and navigate to:
|
||||
|
||||
```
|
||||
http://localhost:8080
|
||||
```
|
||||
|
||||
You'll see the LocalAI WebUI with:
|
||||
- A chat interface
|
||||
- Model gallery
|
||||
- Backend management
|
||||
- Configuration options
|
||||
|
||||
## Step 4: Your First Chat
|
||||
|
||||
### Using the WebUI
|
||||
|
||||
1. In the WebUI, you'll see a chat interface
|
||||
2. Select a model from the dropdown (if multiple models are available)
|
||||
3. Type your message and press Enter
|
||||
4. Wait for the AI to respond!
|
||||
|
||||
### Using the API (Command Line)
|
||||
|
||||
You can also chat using curl:
|
||||
|
||||
```bash
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello! Can you introduce yourself?"}
|
||||
],
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
### Using Python
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={
|
||||
"model": "gpt-4",
|
||||
"messages": [
|
||||
{"role": "user", "content": "Hello! Can you introduce yourself?"}
|
||||
],
|
||||
"temperature": 0.7
|
||||
}
|
||||
)
|
||||
|
||||
print(response.json()["choices"][0]["message"]["content"])
|
||||
```
|
||||
|
||||
## Step 5: Try Different Models
|
||||
|
||||
If you're using the AIO image, you have several models pre-installed:
|
||||
|
||||
- `gpt-4` - Text generation
|
||||
- `gpt-4-vision-preview` - Vision and text
|
||||
- `tts-1` - Text to speech
|
||||
- `whisper-1` - Speech to text
|
||||
|
||||
Try asking the vision model about an image, or generate speech with the TTS model!
|
||||
|
||||
### Installing New Models via WebUI
|
||||
|
||||
To install additional models, you can use the WebUI's import interface:
|
||||
|
||||
1. In the WebUI, navigate to the "Models" tab
|
||||
2. Click "Import Model" or "New Model"
|
||||
3. Enter a model URI (e.g., `huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf`)
|
||||
4. Configure preferences or use Advanced Mode for YAML editing
|
||||
5. Click "Import Model" to start the installation
|
||||
|
||||
For more details, see [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}}).
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Port 8080 is already in use
|
||||
|
||||
Change the port mapping:
|
||||
```bash
|
||||
docker run -p 8081:8080 --name local-ai -ti localai/localai:latest-aio-cpu
|
||||
```
|
||||
Then access at `http://localhost:8081`
|
||||
|
||||
### No models available
|
||||
|
||||
If you're using a standard (non-AIO) image, you need to install models. See [Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}}) tutorial.
|
||||
|
||||
### Slow responses
|
||||
|
||||
- Check if you have enough RAM
|
||||
- Consider using a smaller model
|
||||
- Enable GPU acceleration (see [Using GPU]({{% relref "docs/tutorials/using-gpu" %}}))
|
||||
|
||||
## What's Next?
|
||||
|
||||
Congratulations! You've successfully set up LocalAI and had your first chat. Here's what to explore next:
|
||||
|
||||
1. **[Setting Up Models]({{% relref "docs/tutorials/setting-up-models" %}})** - Learn how to install and configure different models
|
||||
2. **[Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}})** - Speed up inference with GPU support
|
||||
3. **[Try It Out]({{% relref "docs/getting-started/try-it-out" %}})** - Explore more API endpoints and features
|
||||
4. **[Features Documentation]({{% relref "docs/features" %}})** - Discover all LocalAI capabilities
|
||||
|
||||
## See Also
|
||||
|
||||
- [Quickstart Guide]({{% relref "docs/getting-started/quickstart" %}})
|
||||
- [FAQ]({{% relref "docs/faq" %}})
|
||||
- [Troubleshooting Guide]({{% relref "docs/troubleshooting" %}})
|
||||
|
||||
361
docs/content/docs/tutorials/integration-examples.md
Normal file
361
docs/content/docs/tutorials/integration-examples.md
Normal file
@@ -0,0 +1,361 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Integration Examples"
|
||||
weight = 5
|
||||
icon = "sync"
|
||||
description = "Learn how to integrate LocalAI with popular frameworks and tools"
|
||||
+++
|
||||
|
||||
This tutorial shows you how to integrate LocalAI with popular AI frameworks and tools. LocalAI's OpenAI-compatible API makes it easy to use as a drop-in replacement.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- LocalAI running and accessible
|
||||
- Basic knowledge of the framework you want to integrate
|
||||
- Python, Node.js, or other runtime as needed
|
||||
|
||||
## Python Integrations
|
||||
|
||||
### LangChain
|
||||
|
||||
LangChain has built-in support for LocalAI:
|
||||
|
||||
```python
|
||||
from langchain.llms import OpenAI
|
||||
from langchain.chat_models import ChatOpenAI
|
||||
|
||||
# For chat models
|
||||
llm = ChatOpenAI(
|
||||
openai_api_key="not-needed",
|
||||
openai_api_base="http://localhost:8080/v1",
|
||||
model_name="gpt-4"
|
||||
)
|
||||
|
||||
response = llm.predict("Hello, how are you?")
|
||||
print(response)
|
||||
```
|
||||
|
||||
### OpenAI Python SDK
|
||||
|
||||
The official OpenAI Python SDK works directly with LocalAI:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
openai.api_base = "http://localhost:8080/v1"
|
||||
openai.api_key = "not-needed"
|
||||
|
||||
response = openai.ChatCompletion.create(
|
||||
model="gpt-4",
|
||||
messages=[
|
||||
{"role": "user", "content": "Hello!"}
|
||||
]
|
||||
)
|
||||
|
||||
print(response.choices[0].message.content)
|
||||
```
|
||||
|
||||
### LangChain with LocalAI Functions
|
||||
|
||||
```python
|
||||
from langchain.agents import initialize_agent, Tool
|
||||
from langchain.llms import OpenAI
|
||||
|
||||
llm = OpenAI(
|
||||
openai_api_key="not-needed",
|
||||
openai_api_base="http://localhost:8080/v1"
|
||||
)
|
||||
|
||||
tools = [
|
||||
Tool(
|
||||
name="Calculator",
|
||||
func=lambda x: eval(x),
|
||||
description="Useful for mathematical calculations"
|
||||
)
|
||||
]
|
||||
|
||||
agent = initialize_agent(tools, llm, agent="zero-shot-react-description")
|
||||
result = agent.run("What is 25 * 4?")
|
||||
```
|
||||
|
||||
## JavaScript/TypeScript Integrations
|
||||
|
||||
### OpenAI Node.js SDK
|
||||
|
||||
```javascript
|
||||
import OpenAI from 'openai';
|
||||
|
||||
const openai = new OpenAI({
|
||||
baseURL: 'http://localhost:8080/v1',
|
||||
apiKey: 'not-needed',
|
||||
});
|
||||
|
||||
async function main() {
|
||||
const completion = await openai.chat.completions.create({
|
||||
model: 'gpt-4',
|
||||
messages: [{ role: 'user', content: 'Hello!' }],
|
||||
});
|
||||
|
||||
console.log(completion.choices[0].message.content);
|
||||
}
|
||||
|
||||
main();
|
||||
```
|
||||
|
||||
### LangChain.js
|
||||
|
||||
```javascript
|
||||
import { ChatOpenAI } from "langchain/chat_models/openai";
|
||||
|
||||
const model = new ChatOpenAI({
|
||||
openAIApiKey: "not-needed",
|
||||
configuration: {
|
||||
baseURL: "http://localhost:8080/v1",
|
||||
},
|
||||
modelName: "gpt-4",
|
||||
});
|
||||
|
||||
const response = await model.invoke("Hello, how are you?");
|
||||
console.log(response.content);
|
||||
```
|
||||
|
||||
## Integration with Specific Tools
|
||||
|
||||
### AutoGPT
|
||||
|
||||
AutoGPT can use LocalAI by setting the API base URL:
|
||||
|
||||
```bash
|
||||
export OPENAI_API_BASE=http://localhost:8080/v1
|
||||
export OPENAI_API_KEY=not-needed
|
||||
```
|
||||
|
||||
Then run AutoGPT normally.
|
||||
|
||||
### Flowise
|
||||
|
||||
Flowise supports LocalAI out of the box. In the Flowise UI:
|
||||
|
||||
1. Add a ChatOpenAI node
|
||||
2. Set the base URL to `http://localhost:8080/v1`
|
||||
3. Set API key to any value (or leave empty)
|
||||
4. Select your model
|
||||
|
||||
### Continue (VS Code Extension)
|
||||
|
||||
Configure Continue to use LocalAI:
|
||||
|
||||
```json
|
||||
{
|
||||
"models": [
|
||||
{
|
||||
"title": "LocalAI",
|
||||
"provider": "openai",
|
||||
"model": "gpt-4",
|
||||
"apiBase": "http://localhost:8080/v1",
|
||||
"apiKey": "not-needed"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### AnythingLLM
|
||||
|
||||
AnythingLLM has native LocalAI support:
|
||||
|
||||
1. Go to Settings > LLM Preference
|
||||
2. Select "LocalAI"
|
||||
3. Enter your LocalAI endpoint: `http://localhost:8080`
|
||||
4. Select your model
|
||||
|
||||
## REST API Examples
|
||||
|
||||
### cURL
|
||||
|
||||
```bash
|
||||
# Chat completion
|
||||
curl http://localhost:8080/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "gpt-4",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
|
||||
# List models
|
||||
curl http://localhost:8080/v1/models
|
||||
|
||||
# Embeddings
|
||||
curl http://localhost:8080/v1/embeddings \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "text-embedding-ada-002",
|
||||
"input": "Hello world"
|
||||
}'
|
||||
```
|
||||
|
||||
### Python Requests
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={
|
||||
"model": "gpt-4",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}
|
||||
)
|
||||
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
## Advanced Integrations
|
||||
|
||||
### Custom Wrapper
|
||||
|
||||
Create a custom wrapper for your application:
|
||||
|
||||
```python
|
||||
class LocalAIClient:
|
||||
def __init__(self, base_url="http://localhost:8080/v1"):
|
||||
self.base_url = base_url
|
||||
self.api_key = "not-needed"
|
||||
|
||||
def chat(self, messages, model="gpt-4", **kwargs):
|
||||
response = requests.post(
|
||||
f"{self.base_url}/chat/completions",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
**kwargs
|
||||
},
|
||||
headers={"Authorization": f"Bearer {self.api_key}"}
|
||||
)
|
||||
return response.json()
|
||||
|
||||
def embeddings(self, text, model="text-embedding-ada-002"):
|
||||
response = requests.post(
|
||||
f"{self.base_url}/embeddings",
|
||||
json={
|
||||
"model": model,
|
||||
"input": text
|
||||
}
|
||||
)
|
||||
return response.json()
|
||||
```
|
||||
|
||||
### Streaming Responses
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
def stream_chat(messages, model="gpt-4"):
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"stream": True
|
||||
},
|
||||
stream=True
|
||||
)
|
||||
|
||||
for line in response.iter_lines():
|
||||
if line:
|
||||
data = json.loads(line.decode('utf-8').replace('data: ', ''))
|
||||
if 'choices' in data:
|
||||
content = data['choices'][0].get('delta', {}).get('content', '')
|
||||
if content:
|
||||
yield content
|
||||
```
|
||||
|
||||
## Common Integration Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import requests
|
||||
from requests.exceptions import RequestException
|
||||
|
||||
def safe_chat_request(messages, model="gpt-4", retries=3):
|
||||
for attempt in range(retries):
|
||||
try:
|
||||
response = requests.post(
|
||||
"http://localhost:8080/v1/chat/completions",
|
||||
json={"model": model, "messages": messages},
|
||||
timeout=30
|
||||
)
|
||||
response.raise_for_status()
|
||||
return response.json()
|
||||
except RequestException as e:
|
||||
if attempt == retries - 1:
|
||||
raise
|
||||
time.sleep(2 ** attempt) # Exponential backoff
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
```python
|
||||
from functools import wraps
|
||||
import time
|
||||
|
||||
def rate_limit(calls_per_second=2):
|
||||
min_interval = 1.0 / calls_per_second
|
||||
last_called = [0.0]
|
||||
|
||||
def decorator(func):
|
||||
@wraps(func)
|
||||
def wrapper(*args, **kwargs):
|
||||
elapsed = time.time() - last_called[0]
|
||||
left_to_wait = min_interval - elapsed
|
||||
if left_to_wait > 0:
|
||||
time.sleep(left_to_wait)
|
||||
ret = func(*args, **kwargs)
|
||||
last_called[0] = time.time()
|
||||
return ret
|
||||
return wrapper
|
||||
return decorator
|
||||
|
||||
@rate_limit(calls_per_second=2)
|
||||
def chat_request(messages):
|
||||
# Your chat request here
|
||||
pass
|
||||
```
|
||||
|
||||
## Testing Integrations
|
||||
|
||||
### Unit Tests
|
||||
|
||||
```python
|
||||
import unittest
|
||||
from unittest.mock import patch, Mock
|
||||
import requests
|
||||
|
||||
class TestLocalAIIntegration(unittest.TestCase):
|
||||
@patch('requests.post')
|
||||
def test_chat_completion(self, mock_post):
|
||||
mock_response = Mock()
|
||||
mock_response.json.return_value = {
|
||||
"choices": [{
|
||||
"message": {"content": "Hello!"}
|
||||
}]
|
||||
}
|
||||
mock_post.return_value = mock_response
|
||||
|
||||
# Your integration code here
|
||||
# Assertions
|
||||
```
|
||||
|
||||
## What's Next?
|
||||
|
||||
- [API Reference]({{% relref "docs/reference/api-reference" %}}) - Complete API documentation
|
||||
- [Integrations]({{% relref "docs/integrations" %}}) - List of compatible projects
|
||||
- [Examples Repository](https://github.com/mudler/LocalAI-examples) - More integration examples
|
||||
|
||||
## See Also
|
||||
|
||||
- [Features Documentation]({{% relref "docs/features" %}}) - All LocalAI capabilities
|
||||
- [FAQ]({{% relref "docs/faq" %}}) - Common integration questions
|
||||
- [Troubleshooting]({{% relref "docs/troubleshooting" %}}) - Integration issues
|
||||
|
||||
267
docs/content/docs/tutorials/setting-up-models.md
Normal file
267
docs/content/docs/tutorials/setting-up-models.md
Normal file
@@ -0,0 +1,267 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Setting Up Models"
|
||||
weight = 2
|
||||
icon = "hub"
|
||||
description = "Learn how to install, configure, and manage models in LocalAI"
|
||||
+++
|
||||
|
||||
This tutorial covers everything you need to know about installing and configuring models in LocalAI. You'll learn multiple methods to get models running.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- LocalAI installed and running (see [Your First Chat]({{% relref "docs/tutorials/first-chat" %}}) if you haven't set it up yet)
|
||||
- Basic understanding of command line usage
|
||||
|
||||
## Method 1: Using the Model Gallery (Easiest)
|
||||
|
||||
The Model Gallery is the simplest way to install models. It provides pre-configured models ready to use.
|
||||
|
||||
### Via WebUI
|
||||
|
||||
1. Open the LocalAI WebUI at `http://localhost:8080`
|
||||
2. Navigate to the "Models" tab
|
||||
3. Browse available models
|
||||
4. Click "Install" on any model you want
|
||||
5. Wait for installation to complete
|
||||
|
||||
## Method 1.5: Import Models via WebUI
|
||||
|
||||
The WebUI provides a powerful model import interface that supports both simple and advanced configuration:
|
||||
|
||||
### Simple Import Mode
|
||||
|
||||
1. Open the LocalAI WebUI at `http://localhost:8080`
|
||||
2. Click "Import Model"
|
||||
3. Enter the model URI (e.g., `https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct-GGUF`)
|
||||
4. Optionally configure preferences:
|
||||
- Backend selection
|
||||
- Model name
|
||||
- Description
|
||||
- Quantizations
|
||||
- Embeddings support
|
||||
- Custom preferences
|
||||
5. Click "Import Model" to start the import process
|
||||
|
||||
### Advanced Import Mode
|
||||
|
||||
For full control over model configuration:
|
||||
|
||||
1. In the WebUI, click "Import Model"
|
||||
2. Toggle to "Advanced Mode"
|
||||
3. Edit the YAML configuration directly in the code editor
|
||||
4. Use the "Validate" button to check your configuration
|
||||
5. Click "Create" or "Update" to save
|
||||
|
||||
The advanced editor includes:
|
||||
- Syntax highlighting
|
||||
- YAML validation
|
||||
- Format and copy tools
|
||||
- Full configuration options
|
||||
|
||||
This is especially useful for:
|
||||
- Custom model configurations
|
||||
- Fine-tuning model parameters
|
||||
- Setting up complex model setups
|
||||
- Editing existing model configurations
|
||||
|
||||
### Via CLI
|
||||
|
||||
```bash
|
||||
# List available models
|
||||
local-ai models list
|
||||
|
||||
# Install a specific model
|
||||
local-ai models install llama-3.2-1b-instruct:q4_k_m
|
||||
|
||||
# Start LocalAI with a model from the gallery
|
||||
local-ai run llama-3.2-1b-instruct:q4_k_m
|
||||
```
|
||||
|
||||
### Browse Online
|
||||
|
||||
Visit [models.localai.io](https://models.localai.io) to browse all available models in your browser.
|
||||
|
||||
## Method 2: Installing from Hugging Face
|
||||
|
||||
LocalAI can directly install models from Hugging Face:
|
||||
|
||||
```bash
|
||||
# Install and run a model from Hugging Face
|
||||
local-ai run huggingface://TheBloke/phi-2-GGUF/phi-2.Q8_0.gguf
|
||||
```
|
||||
|
||||
The format is: `huggingface://<repository>/<model-file>`
|
||||
|
||||
## Method 3: Installing from OCI Registries
|
||||
|
||||
### Ollama Registry
|
||||
|
||||
```bash
|
||||
local-ai run ollama://gemma:2b
|
||||
```
|
||||
|
||||
### Standard OCI Registry
|
||||
|
||||
```bash
|
||||
local-ai run oci://localai/phi-2:latest
|
||||
```
|
||||
|
||||
## Method 4: Manual Installation
|
||||
|
||||
For full control, you can manually download and configure models.
|
||||
|
||||
### Step 1: Download a Model
|
||||
|
||||
Download a GGUF model file. Popular sources:
|
||||
- [Hugging Face](https://huggingface.co/models?search=gguf)
|
||||
|
||||
Example:
|
||||
```bash
|
||||
mkdir -p models
|
||||
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf \
|
||||
-O models/phi-2.Q4_K_M.gguf
|
||||
```
|
||||
|
||||
### Step 2: Create a Configuration File (Optional)
|
||||
|
||||
Create a YAML file to configure the model:
|
||||
|
||||
```yaml
|
||||
# models/phi-2.yaml
|
||||
name: phi-2
|
||||
parameters:
|
||||
model: phi-2.Q4_K_M.gguf
|
||||
temperature: 0.7
|
||||
context_size: 2048
|
||||
threads: 4
|
||||
backend: llama-cpp
|
||||
```
|
||||
|
||||
### Step 3: Start LocalAI
|
||||
|
||||
```bash
|
||||
# With Docker
|
||||
docker run -p 8080:8080 -v $PWD/models:/models \
|
||||
localai/localai:latest
|
||||
|
||||
# Or with binary
|
||||
local-ai --models-path ./models
|
||||
```
|
||||
|
||||
## Understanding Model Files
|
||||
|
||||
### File Formats
|
||||
|
||||
- **GGUF**: Modern format, recommended for most use cases
|
||||
- **GGML**: Older format, still supported but deprecated
|
||||
|
||||
### Quantization Levels
|
||||
|
||||
Models come in different quantization levels (quality vs. size trade-off):
|
||||
|
||||
| Quantization | Size | Quality | Use Case |
|
||||
|-------------|------|---------|----------|
|
||||
| Q8_0 | Largest | Highest | Best quality, requires more RAM |
|
||||
| Q6_K | Large | Very High | High quality |
|
||||
| Q4_K_M | Medium | High | Balanced (recommended) |
|
||||
| Q4_K_S | Small | Medium | Lower RAM usage |
|
||||
| Q2_K | Smallest | Lower | Minimal RAM, lower quality |
|
||||
|
||||
### Choosing the Right Model
|
||||
|
||||
Consider:
|
||||
- **RAM available**: Larger models need more RAM
|
||||
- **Use case**: Different models excel at different tasks
|
||||
- **Speed**: Smaller quantizations are faster
|
||||
- **Quality**: Higher quantizations produce better output
|
||||
|
||||
## Model Configuration
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
Create a YAML file in your models directory:
|
||||
|
||||
```yaml
|
||||
name: my-model
|
||||
parameters:
|
||||
model: model.gguf
|
||||
temperature: 0.7
|
||||
top_p: 0.9
|
||||
context_size: 2048
|
||||
threads: 4
|
||||
backend: llama-cpp
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
See the [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) guide for all available options.
|
||||
|
||||
## Managing Models
|
||||
|
||||
### List Installed Models
|
||||
|
||||
```bash
|
||||
# Via API
|
||||
curl http://localhost:8080/v1/models
|
||||
|
||||
# Via CLI
|
||||
local-ai models list
|
||||
```
|
||||
|
||||
### Remove Models
|
||||
|
||||
Simply delete the model file and configuration from your models directory:
|
||||
|
||||
```bash
|
||||
rm models/model-name.gguf
|
||||
rm models/model-name.yaml # if exists
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Model Not Loading
|
||||
|
||||
1. **Check backend**: Ensure the required backend is installed
|
||||
```bash
|
||||
local-ai backends list
|
||||
local-ai backends install llama-cpp # if needed
|
||||
```
|
||||
|
||||
2. **Check logs**: Enable debug mode
|
||||
```bash
|
||||
DEBUG=true local-ai
|
||||
```
|
||||
|
||||
3. **Verify file**: Ensure the model file is not corrupted
|
||||
|
||||
### Out of Memory
|
||||
|
||||
- Use a smaller quantization (Q4_K_S or Q2_K)
|
||||
- Reduce `context_size` in configuration
|
||||
- Close other applications to free RAM
|
||||
|
||||
### Wrong Backend
|
||||
|
||||
Check the [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) to ensure you're using the correct backend for your model.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start small**: Begin with smaller models to test your setup
|
||||
2. **Use quantized models**: Q4_K_M is a good balance for most use cases
|
||||
3. **Organize models**: Keep your models directory organized
|
||||
4. **Backup configurations**: Save your YAML configurations
|
||||
5. **Monitor resources**: Watch RAM and disk usage
|
||||
|
||||
## What's Next?
|
||||
|
||||
- [Using GPU Acceleration]({{% relref "docs/tutorials/using-gpu" %}}) - Speed up inference
|
||||
- [Model Configuration]({{% relref "docs/advanced/model-configuration" %}}) - Advanced configuration options
|
||||
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - Find compatible models and backends
|
||||
|
||||
## See Also
|
||||
|
||||
- [Model Gallery Documentation]({{% relref "docs/features/model-gallery" %}})
|
||||
- [Install and Run Models]({{% relref "docs/getting-started/models" %}})
|
||||
- [FAQ]({{% relref "docs/faq" %}})
|
||||
|
||||
254
docs/content/docs/tutorials/using-gpu.md
Normal file
254
docs/content/docs/tutorials/using-gpu.md
Normal file
@@ -0,0 +1,254 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "Using GPU Acceleration"
|
||||
weight = 3
|
||||
icon = "memory"
|
||||
description = "Set up GPU acceleration for faster inference"
|
||||
+++
|
||||
|
||||
This tutorial will guide you through setting up GPU acceleration for LocalAI. GPU acceleration can significantly speed up model inference, especially for larger models.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A compatible GPU (NVIDIA, AMD, Intel, or Apple Silicon)
|
||||
- LocalAI installed
|
||||
- Basic understanding of your system's GPU setup
|
||||
|
||||
## Check Your GPU
|
||||
|
||||
First, verify you have a compatible GPU:
|
||||
|
||||
### NVIDIA
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
```
|
||||
|
||||
You should see your GPU information. Ensure you have CUDA 11.7 or 12.0+ installed.
|
||||
|
||||
### AMD
|
||||
|
||||
```bash
|
||||
rocminfo
|
||||
```
|
||||
|
||||
### Intel
|
||||
|
||||
```bash
|
||||
intel_gpu_top # if available
|
||||
```
|
||||
|
||||
### Apple Silicon (macOS)
|
||||
|
||||
Apple Silicon (M1/M2/M3) GPUs are automatically detected. No additional setup needed!
|
||||
|
||||
## Installation Methods
|
||||
|
||||
### Method 1: Docker with GPU Support (Recommended)
|
||||
|
||||
#### NVIDIA CUDA
|
||||
|
||||
```bash
|
||||
# CUDA 12.0
|
||||
docker run -p 8080:8080 --gpus all --name local-ai \
|
||||
-ti localai/localai:latest-gpu-nvidia-cuda-12
|
||||
|
||||
# CUDA 11.7
|
||||
docker run -p 8080:8080 --gpus all --name local-ai \
|
||||
-ti localai/localai:latest-gpu-nvidia-cuda-11
|
||||
```
|
||||
|
||||
**Prerequisites**: Install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
|
||||
|
||||
#### AMD ROCm
|
||||
|
||||
```bash
|
||||
docker run -p 8080:8080 \
|
||||
--device=/dev/kfd \
|
||||
--device=/dev/dri \
|
||||
--group-add=video \
|
||||
--name local-ai \
|
||||
-ti localai/localai:latest-gpu-hipblas
|
||||
```
|
||||
|
||||
#### Intel GPU
|
||||
|
||||
```bash
|
||||
docker run -p 8080:8080 --name local-ai \
|
||||
-ti localai/localai:latest-gpu-intel
|
||||
```
|
||||
|
||||
#### Apple Silicon
|
||||
|
||||
GPU acceleration works automatically when running on macOS with Apple Silicon. Use the standard CPU image - Metal acceleration is built-in.
|
||||
|
||||
### Method 2: AIO Images with GPU
|
||||
|
||||
AIO images are also available with GPU support:
|
||||
|
||||
```bash
|
||||
# NVIDIA CUDA 12
|
||||
docker run -p 8080:8080 --gpus all --name local-ai \
|
||||
-ti localai/localai:latest-aio-gpu-nvidia-cuda-12
|
||||
|
||||
# AMD
|
||||
docker run -p 8080:8080 \
|
||||
--device=/dev/kfd --device=/dev/dri --group-add=video \
|
||||
--name local-ai \
|
||||
-ti localai/localai:latest-aio-gpu-hipblas
|
||||
```
|
||||
|
||||
### Method 3: Build from Source
|
||||
|
||||
For building with GPU support from source, see the [Build Guide]({{% relref "docs/getting-started/build" %}}).
|
||||
|
||||
## Configuring Models for GPU
|
||||
|
||||
### Automatic Detection
|
||||
|
||||
LocalAI automatically detects GPU capabilities and downloads the appropriate backend when you install models from the gallery.
|
||||
|
||||
### Manual Configuration
|
||||
|
||||
In your model YAML configuration, specify GPU layers:
|
||||
|
||||
```yaml
|
||||
name: my-model
|
||||
parameters:
|
||||
model: model.gguf
|
||||
backend: llama-cpp
|
||||
# Offload layers to GPU (adjust based on your GPU memory)
|
||||
f16: true
|
||||
gpu_layers: 35 # Number of layers to offload to GPU
|
||||
```
|
||||
|
||||
**GPU Layers Guidelines**:
|
||||
- **Small GPU (4-6GB)**: 20-30 layers
|
||||
- **Medium GPU (8-12GB)**: 30-40 layers
|
||||
- **Large GPU (16GB+)**: 40+ layers or set to model's total layer count
|
||||
|
||||
### Finding the Right Number of Layers
|
||||
|
||||
1. Start with a conservative number (e.g., 20)
|
||||
2. Monitor GPU memory usage: `nvidia-smi` (NVIDIA) or `rocminfo` (AMD)
|
||||
3. Gradually increase until you reach GPU memory limits
|
||||
4. For maximum performance, offload all layers if you have enough VRAM
|
||||
|
||||
## Verifying GPU Usage
|
||||
|
||||
### Check if GPU is Being Used
|
||||
|
||||
#### NVIDIA
|
||||
|
||||
```bash
|
||||
# Watch GPU usage in real-time
|
||||
watch -n 1 nvidia-smi
|
||||
```
|
||||
|
||||
You should see:
|
||||
- GPU utilization > 0%
|
||||
- Memory usage increasing
|
||||
- Processes running on GPU
|
||||
|
||||
#### AMD
|
||||
|
||||
```bash
|
||||
rocm-smi
|
||||
```
|
||||
|
||||
#### Check Logs
|
||||
|
||||
Enable debug mode to see GPU information in logs:
|
||||
|
||||
```bash
|
||||
DEBUG=true local-ai
|
||||
```
|
||||
|
||||
Look for messages indicating GPU initialization and layer offloading.
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### 1. Optimize GPU Layers
|
||||
|
||||
- Offload as many layers as your GPU memory allows
|
||||
- Balance between GPU and CPU layers for best performance
|
||||
- Use `f16: true` for better GPU performance
|
||||
|
||||
### 2. Batch Processing
|
||||
|
||||
GPU excels at batch processing. Process multiple requests together when possible.
|
||||
|
||||
### 3. Model Quantization
|
||||
|
||||
Even with GPU, quantized models (Q4_K_M) often provide the best speed/quality balance.
|
||||
|
||||
### 4. Context Size
|
||||
|
||||
Larger context sizes use more GPU memory. Adjust based on your GPU:
|
||||
|
||||
```yaml
|
||||
context_size: 4096 # Adjust based on GPU memory
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GPU Not Detected
|
||||
|
||||
1. **Check drivers**: Ensure GPU drivers are installed
|
||||
2. **Check Docker**: Verify Docker has GPU access
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu22.04 nvidia-smi
|
||||
```
|
||||
3. **Check logs**: Enable debug mode and check for GPU-related errors
|
||||
|
||||
### Out of GPU Memory
|
||||
|
||||
- Reduce `gpu_layers` in model configuration
|
||||
- Use a smaller model or lower quantization
|
||||
- Reduce `context_size`
|
||||
- Close other GPU-using applications
|
||||
|
||||
### Slow Performance
|
||||
|
||||
- Ensure you're using the correct GPU image
|
||||
- Check that layers are actually offloaded (check logs)
|
||||
- Verify GPU drivers are up to date
|
||||
- Consider using a more powerful GPU or reducing model size
|
||||
|
||||
### CUDA Errors
|
||||
|
||||
- Ensure CUDA version matches (11.7 vs 12.0)
|
||||
- Check CUDA compatibility with your GPU
|
||||
- Try rebuilding with `REBUILD=true`
|
||||
|
||||
## Platform-Specific Notes
|
||||
|
||||
### NVIDIA Jetson (L4T)
|
||||
|
||||
Use the L4T-specific images:
|
||||
|
||||
```bash
|
||||
docker run -p 8080:8080 --runtime nvidia --gpus all \
|
||||
--name local-ai \
|
||||
-ti localai/localai:latest-nvidia-l4t-arm64
|
||||
```
|
||||
|
||||
### Apple Silicon
|
||||
|
||||
- Metal acceleration is automatic
|
||||
- No special Docker flags needed
|
||||
- Use standard CPU images - Metal is built-in
|
||||
- For best performance, build from source on macOS
|
||||
|
||||
## What's Next?
|
||||
|
||||
- [GPU Acceleration Documentation]({{% relref "docs/features/gpu-acceleration" %}}) - Detailed GPU information
|
||||
- [Performance Tuning]({{% relref "docs/advanced/performance-tuning" %}}) - Optimize your setup
|
||||
- [VRAM Management]({{% relref "docs/advanced/vram-management" %}}) - Manage GPU memory efficiently
|
||||
|
||||
## See Also
|
||||
|
||||
- [Compatibility Table]({{% relref "docs/reference/compatibility-table" %}}) - GPU support by backend
|
||||
- [Build Guide]({{% relref "docs/getting-started/build" %}}) - Build with GPU support
|
||||
- [FAQ]({{% relref "docs/faq" %}}) - Common GPU questions
|
||||
|
||||
@@ -1,16 +1,60 @@
|
||||
+++
|
||||
disableToc = false
|
||||
title = "News"
|
||||
title = "What's New"
|
||||
weight = 7
|
||||
url = '/basics/news/'
|
||||
icon = "newspaper"
|
||||
+++
|
||||
|
||||
Release notes have been now moved completely over Github releases.
|
||||
Release notes have been moved to GitHub releases for the most up-to-date information.
|
||||
|
||||
You can see the release notes [here](https://github.com/mudler/LocalAI/releases).
|
||||
You can see all release notes [here](https://github.com/mudler/LocalAI/releases).
|
||||
|
||||
# Older release notes
|
||||
## Recent Highlights
|
||||
|
||||
### 2025
|
||||
|
||||
**July 2025**: All backends migrated outside of the main binary. LocalAI is now more lightweight, small, and automatically downloads the required backend to run the model. [Read the release notes](https://github.com/mudler/LocalAI/releases/tag/v3.2.0)
|
||||
|
||||
**June 2025**: [Backend management](https://github.com/mudler/LocalAI/pull/5607) has been added. Attention: extras images are going to be deprecated from the next release! Read [the backend management PR](https://github.com/mudler/LocalAI/pull/5607).
|
||||
|
||||
**May 2025**: [Audio input](https://github.com/mudler/LocalAI/pull/5466) and [Reranking](https://github.com/mudler/LocalAI/pull/5396) in llama.cpp backend, [Realtime API](https://github.com/mudler/LocalAI/pull/5392), Support to Gemma, SmollVLM, and more multimodal models (available in the gallery).
|
||||
|
||||
**May 2025**: Important: image name changes [See release](https://github.com/mudler/LocalAI/releases/tag/v2.29.0)
|
||||
|
||||
**April 2025**: Rebrand, WebUI enhancements
|
||||
|
||||
**April 2025**: [LocalAGI](https://github.com/mudler/LocalAGI) and [LocalRecall](https://github.com/mudler/LocalRecall) join the LocalAI family stack.
|
||||
|
||||
**April 2025**: WebUI overhaul, AIO images updates
|
||||
|
||||
**February 2025**: Backend cleanup, Breaking changes, new backends (kokoro, OutelTTS, faster-whisper), Nvidia L4T images
|
||||
|
||||
**January 2025**: LocalAI model release: https://huggingface.co/mudler/LocalAI-functioncall-phi-4-v0.3, SANA support in diffusers: https://github.com/mudler/LocalAI/pull/4603
|
||||
|
||||
### 2024
|
||||
|
||||
**December 2024**: stablediffusion.cpp backend (ggml) added ( https://github.com/mudler/LocalAI/pull/4289 )
|
||||
|
||||
**November 2024**: Bark.cpp backend added ( https://github.com/mudler/LocalAI/pull/4287 )
|
||||
|
||||
**November 2024**: Voice activity detection models (**VAD**) added to the API: https://github.com/mudler/LocalAI/pull/4204
|
||||
|
||||
**October 2024**: Examples moved to [LocalAI-examples](https://github.com/mudler/LocalAI-examples)
|
||||
|
||||
**August 2024**: 🆕 FLUX-1, [P2P Explorer](https://explorer.localai.io)
|
||||
|
||||
**July 2024**: 🔥🔥 🆕 P2P Dashboard, LocalAI Federated mode and AI Swarms: https://github.com/mudler/LocalAI/pull/2723. P2P Global community pools: https://github.com/mudler/LocalAI/issues/3113
|
||||
|
||||
**May 2024**: 🔥🔥 Decentralized P2P llama.cpp: https://github.com/mudler/LocalAI/pull/2343 (peer2peer llama.cpp!) 👉 Docs https://localai.io/features/distribute/
|
||||
|
||||
**May 2024**: 🔥🔥 Distributed inferencing: https://github.com/mudler/LocalAI/pull/2324
|
||||
|
||||
**April 2024**: Reranker API: https://github.com/mudler/LocalAI/pull/2121
|
||||
|
||||
---
|
||||
|
||||
## Archive: Older Release Notes (2023 and earlier)
|
||||
|
||||
## 04-12-2023: __v2.0.0__
|
||||
|
||||
@@ -58,7 +102,7 @@ Thanks to @jespino now the local-ai binary has more subcommands allowing to mana
|
||||
|
||||
This is an exciting LocalAI release! Besides bug-fixes and enhancements this release brings the new backend to a whole new level by extending support to vllm and vall-e-x for audio generation!
|
||||
|
||||
Check out the documentation for vllm [here](https://localai.io/model-compatibility/vllm/) and Vall-E-X [here](https://localai.io/model-compatibility/vall-e-x/)
|
||||
Check out the documentation for vllm [here]({{% relref "docs/reference/compatibility-table" %}}) and Vall-E-X [here]({{% relref "docs/reference/compatibility-table" %}})
|
||||
|
||||
[Release notes](https://github.com/mudler/LocalAI/releases/tag/v1.30.0)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user