mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-03 22:07:58 -04:00
* docs: add 'how LocalAI works' architecture diagram Add a blueprint-style architecture diagram: clients -> small core (API, router, WebUI, agents) -> gRPC -> backend processes pulled on demand as OCI images. Place it on the overview page and replace the stale external architecture image on the reference page. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add blueprint diagrams across feature, distributed & getting-started docs Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under docs/static/images/diagrams/, wired into their docs pages, from an impact-vs-effort audit of the docs. Broaden the API surface on the overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama, and LocalAI's own API) and move the gRPC boundary label clear of the arrows. Pages: distributed mode (architecture, scheduling, ds4 layer-split), distributed inferencing, MLX, realtime, quantization, MCP, agents, mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face recognition, reranker, function calling, fine-tuning (recipe + jobs), diarization, audio transform, quickstart, model resolution. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: add composable-core diagram to README hero Commit the composable-core card (small core + on-demand backend tiles) alongside the other diagrams and reference it from the README hero via a repo-relative path, so it renders on GitHub. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> * docs: fix composable-core connectors/badge and federated-vs-worker layout - composable-core: thicken the plug-in connectors so they read clearly, and widen the SEPARATE IMAGE badge so its text no longer overflows the box. - federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and replace the tangled node-to-node activation arrows with a clean fan-out (request split across all sharded nodes), mirroring the federated panel. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
161 lines
5.6 KiB
Markdown
161 lines
5.6 KiB
Markdown
+++
|
|
disableToc = false
|
|
title = "Model Quantization"
|
|
weight = 19
|
|
url = '/features/quantization/'
|
|
+++
|
|
|
|

|
|
|
|
LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.
|
|
|
|
{{% notice note %}}
|
|
This feature is **experimental** and may change in future releases.
|
|
{{% /notice %}}
|
|
|
|
## Supported Backends
|
|
|
|
| Backend | Description | Quantization Types | Platforms |
|
|
|---------|-------------|-------------------|-----------|
|
|
| **llama-cpp-quantization** | Downloads HF models, converts to GGUF, and quantizes using llama.cpp | q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16 | CPU (Linux, macOS) |
|
|
|
|
## Quick Start
|
|
|
|
### 1. Start a quantization job
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/api/quantization/jobs \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "unsloth/functiongemma-270m-it",
|
|
"quantization_type": "q4_k_m"
|
|
}'
|
|
```
|
|
|
|
### 2. Monitor progress (SSE stream)
|
|
|
|
```bash
|
|
curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress
|
|
```
|
|
|
|
### 3. Download the quantized model
|
|
|
|
```bash
|
|
curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download
|
|
```
|
|
|
|
### 4. Or import it directly into LocalAI
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"name": "my-quantized-model"
|
|
}'
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| `POST` | `/api/quantization/jobs` | Start a quantization job |
|
|
| `GET` | `/api/quantization/jobs` | List all jobs |
|
|
| `GET` | `/api/quantization/jobs/:id` | Get job details |
|
|
| `POST` | `/api/quantization/jobs/:id/stop` | Stop a running job |
|
|
| `DELETE` | `/api/quantization/jobs/:id` | Delete a job and its data |
|
|
| `GET` | `/api/quantization/jobs/:id/progress` | SSE progress stream |
|
|
| `POST` | `/api/quantization/jobs/:id/import` | Import quantized model into LocalAI |
|
|
| `GET` | `/api/quantization/jobs/:id/download` | Download quantized GGUF file |
|
|
| `GET` | `/api/quantization/backends` | List available quantization backends |
|
|
|
|
### Job Request Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `model` | string | HuggingFace model ID or local path (required) |
|
|
| `backend` | string | Backend name (default: `llama-cpp-quantization`) |
|
|
| `quantization_type` | string | Quantization format (default: `q4_k_m`) |
|
|
| `extra_options` | map | Backend-specific options (see below) |
|
|
|
|
### Extra Options
|
|
|
|
| Key | Description |
|
|
|-----|-------------|
|
|
| `hf_token` | HuggingFace token for gated models |
|
|
|
|
### Import Request Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `name` | string | Model name for LocalAI (auto-generated if empty) |
|
|
|
|
### Job Status Values
|
|
|
|
| Status | Description |
|
|
|--------|-------------|
|
|
| `queued` | Job created, waiting to start |
|
|
| `downloading` | Downloading model from HuggingFace |
|
|
| `converting` | Converting model to f16 GGUF |
|
|
| `quantizing` | Running quantization |
|
|
| `completed` | Quantization finished successfully |
|
|
| `failed` | Job failed (check message for details) |
|
|
| `stopped` | Job stopped by user |
|
|
|
|
### Progress Stream
|
|
|
|
The `GET /api/quantization/jobs/:id/progress` endpoint returns Server-Sent Events (SSE) with JSON payloads:
|
|
|
|
```json
|
|
{
|
|
"job_id": "abc-123",
|
|
"progress_percent": 65.0,
|
|
"status": "quantizing",
|
|
"message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",
|
|
"output_file": "",
|
|
"extra_metrics": {}
|
|
}
|
|
```
|
|
|
|
When the job completes, `output_file` contains the path to the quantized GGUF file and `extra_metrics` includes `file_size_mb`.
|
|
|
|
## Quantization Types
|
|
|
|
| Type | Size | Quality | Description |
|
|
|------|------|---------|-------------|
|
|
| `q2_k` | Smallest | Lowest | 2-bit quantization |
|
|
| `q3_k_s` | Very small | Low | 3-bit small |
|
|
| `q3_k_m` | Very small | Low | 3-bit medium |
|
|
| `q3_k_l` | Small | Low-medium | 3-bit large |
|
|
| `q4_0` | Small | Medium | 4-bit legacy |
|
|
| `q4_k_s` | Small | Medium | 4-bit small |
|
|
| `q4_k_m` | Small | **Good** | **4-bit medium (recommended)** |
|
|
| `q5_0` | Medium | Good | 5-bit legacy |
|
|
| `q5_k_s` | Medium | Good | 5-bit small |
|
|
| `q5_k_m` | Medium | Very good | 5-bit medium |
|
|
| `q6_k` | Large | Excellent | 6-bit |
|
|
| `q8_0` | Large | Near-lossless | 8-bit |
|
|
| `f16` | Largest | Lossless | 16-bit (no quantization, GGUF conversion only) |
|
|
|
|
The UI also supports entering a custom quantization type string for any format supported by `llama-quantize`.
|
|
|
|
## Web UI
|
|
|
|
A "Quantize" page appears in the sidebar under the Tools section. The UI provides:
|
|
|
|
1. **Job Configuration** — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token
|
|
2. **Progress Monitor** — Real-time progress bar and log output via SSE
|
|
3. **Jobs List** — View all quantization jobs with status, stop/delete actions
|
|
4. **Output** — Download the quantized GGUF file or import it directly into LocalAI for immediate use
|
|
|
|
## Architecture
|
|
|
|
Quantization uses the same gRPC backend architecture as fine-tuning:
|
|
|
|
1. **Proto layer**: `QuantizationRequest`, `QuantizationProgress` (streaming), `StopQuantization`
|
|
2. **Python backend**: Downloads model, runs `convert_hf_to_gguf.py` and `llama-quantize`
|
|
3. **Go service**: Manages job lifecycle, state persistence, async import
|
|
4. **REST API**: HTTP endpoints with SSE progress streaming
|
|
5. **React UI**: Configuration form, real-time progress monitor, download/import panel
|