LocalAI/docs/content/features/quantization.md at 415b56194752d2d80576f050d462beaaea993d7f

mirror of https://github.com/mudler/LocalAI.git synced 2026-06-03 22:07:58 -04:00

Files

LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137 )

* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

2026-06-02 18:43:22 +02:00

5.6 KiB

Raw Blame History

+++ disableToc = false title = "Model Quantization" weight = 19 url = '/features/quantization/' +++

LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.

{{% notice note %}} This feature is experimental and may change in future releases. {{% /notice %}}

Supported Backends

Backend	Description	Quantization Types	Platforms
llama-cpp-quantization	Downloads HF models, converts to GGUF, and quantizes using llama.cpp	q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16	CPU (Linux, macOS)

Quick Start

1. Start a quantization job

curl -X POST http://localhost:8080/api/quantization/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/functiongemma-270m-it",
    "quantization_type": "q4_k_m"
  }'

2. Monitor progress (SSE stream)

curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress

3. Download the quantized model

curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download

4. Or import it directly into LocalAI

curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-quantized-model"
  }'

API Reference

Endpoints

Method	Path	Description
`POST`	`/api/quantization/jobs`	Start a quantization job
`GET`	`/api/quantization/jobs`	List all jobs
`GET`	`/api/quantization/jobs/:id`	Get job details
`POST`	`/api/quantization/jobs/:id/stop`	Stop a running job
`DELETE`	`/api/quantization/jobs/:id`	Delete a job and its data
`GET`	`/api/quantization/jobs/:id/progress`	SSE progress stream
`POST`	`/api/quantization/jobs/:id/import`	Import quantized model into LocalAI
`GET`	`/api/quantization/jobs/:id/download`	Download quantized GGUF file
`GET`	`/api/quantization/backends`	List available quantization backends

Job Request Fields

Field	Type	Description
`model`	string	HuggingFace model ID or local path (required)
`backend`	string	Backend name (default: `llama-cpp-quantization`)
`quantization_type`	string	Quantization format (default: `q4_k_m`)
`extra_options`	map	Backend-specific options (see below)

Extra Options

Key	Description
`hf_token`	HuggingFace token for gated models

Import Request Fields

Field	Type	Description
`name`	string	Model name for LocalAI (auto-generated if empty)

Job Status Values

Status	Description
`queued`	Job created, waiting to start
`downloading`	Downloading model from HuggingFace
`converting`	Converting model to f16 GGUF
`quantizing`	Running quantization
`completed`	Quantization finished successfully
`failed`	Job failed (check message for details)
`stopped`	Job stopped by user

Progress Stream

The GET /api/quantization/jobs/:id/progress endpoint returns Server-Sent Events (SSE) with JSON payloads:

{
  "job_id": "abc-123",
  "progress_percent": 65.0,
  "status": "quantizing",
  "message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",
  "output_file": "",
  "extra_metrics": {}
}

When the job completes, output_file contains the path to the quantized GGUF file and extra_metrics includes file_size_mb.

Quantization Types

Type	Size	Quality	Description
`q2_k`	Smallest	Lowest	2-bit quantization
`q3_k_s`	Very small	Low	3-bit small
`q3_k_m`	Very small	Low	3-bit medium
`q3_k_l`	Small	Low-medium	3-bit large
`q4_0`	Small	Medium	4-bit legacy
`q4_k_s`	Small	Medium	4-bit small
`q4_k_m`	Small	Good	4-bit medium (recommended)
`q5_0`	Medium	Good	5-bit legacy
`q5_k_s`	Medium	Good	5-bit small
`q5_k_m`	Medium	Very good	5-bit medium
`q6_k`	Large	Excellent	6-bit
`q8_0`	Large	Near-lossless	8-bit
`f16`	Largest	Lossless	16-bit (no quantization, GGUF conversion only)

The UI also supports entering a custom quantization type string for any format supported by llama-quantize.

Web UI

A "Quantize" page appears in the sidebar under the Tools section. The UI provides:

Job Configuration — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token
Progress Monitor — Real-time progress bar and log output via SSE
Jobs List — View all quantization jobs with status, stop/delete actions
Output — Download the quantized GGUF file or import it directly into LocalAI for immediate use

Architecture

Quantization uses the same gRPC backend architecture as fine-tuning:

Proto layer: QuantizationRequest, QuantizationProgress (streaming), StopQuantization
Python backend: Downloads model, runs convert_hf_to_gguf.py and llama-quantize
Go service: Manages job lifecycle, state persistence, async import
REST API: HTTP endpoints with SSE progress streaming
React UI: Configuration form, real-time progress monitor, download/import panel

5.6 KiB Raw Blame History