Files
LocalAI/docs/content/features/quantization.md
LocalAI [bot] 7e59a5c7c5 docs: architecture & feature diagrams (blueprint style) (#10137)
* docs: add 'how LocalAI works' architecture diagram

Add a blueprint-style architecture diagram: clients -> small core (API,
router, WebUI, agents) -> gRPC -> backend processes pulled on demand as
OCI images. Place it on the overview page and replace the stale external
architecture image on the reference page.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add blueprint diagrams across feature, distributed & getting-started docs

Add 24 architecture/flow/comparison diagrams (PNG + HTML source) under
docs/static/images/diagrams/, wired into their docs pages, from an
impact-vs-effort audit of the docs. Broaden the API surface on the
overview architecture diagram (OpenAI, Anthropic, ElevenLabs, Ollama,
and LocalAI's own API) and move the gRPC boundary label clear of the arrows.

Pages: distributed mode (architecture, scheduling, ds4 layer-split),
distributed inferencing, MLX, realtime, quantization, MCP, agents,
mitm & cloud proxy, middleware, reverse-proxy TLS, VRAM, voice & face
recognition, reranker, function calling, fine-tuning (recipe + jobs),
diarization, audio transform, quickstart, model resolution.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: add composable-core diagram to README hero

Commit the composable-core card (small core + on-demand backend tiles)
alongside the other diagrams and reference it from the README hero via a
repo-relative path, so it renders on GitHub.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* docs: fix composable-core connectors/badge and federated-vs-worker layout

- composable-core: thicken the plug-in connectors so they read clearly, and
  widen the SEPARATE IMAGE badge so its text no longer overflows the box.
- federated-vs-worker: shorten the WHOLE/SPLIT REQUEST pills to fit, and
  replace the tangled node-to-node activation arrows with a clean fan-out
  (request split across all sharded nodes), mirroring the federated panel.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-02 18:43:22 +02:00

5.6 KiB

+++ disableToc = false title = "Model Quantization" weight = 19 url = '/features/quantization/' +++

From an HF model to a quantized GGUF: convert to f16, then quantize, tracked as a job

LocalAI supports model quantization directly through the API and Web UI. Quantization converts HuggingFace models to GGUF format and compresses them to smaller sizes for efficient inference with llama.cpp.

{{% notice note %}} This feature is experimental and may change in future releases. {{% /notice %}}

Supported Backends

Backend Description Quantization Types Platforms
llama-cpp-quantization Downloads HF models, converts to GGUF, and quantizes using llama.cpp q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_k_s, q4_k_m, q5_0, q5_k_s, q5_k_m, q6_k, q8_0, f16 CPU (Linux, macOS)

Quick Start

1. Start a quantization job

curl -X POST http://localhost:8080/api/quantization/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/functiongemma-270m-it",
    "quantization_type": "q4_k_m"
  }'

2. Monitor progress (SSE stream)

curl -N http://localhost:8080/api/quantization/jobs/{job_id}/progress

3. Download the quantized model

curl -o model.gguf http://localhost:8080/api/quantization/jobs/{job_id}/download

4. Or import it directly into LocalAI

curl -X POST http://localhost:8080/api/quantization/jobs/{job_id}/import \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-quantized-model"
  }'

API Reference

Endpoints

Method Path Description
POST /api/quantization/jobs Start a quantization job
GET /api/quantization/jobs List all jobs
GET /api/quantization/jobs/:id Get job details
POST /api/quantization/jobs/:id/stop Stop a running job
DELETE /api/quantization/jobs/:id Delete a job and its data
GET /api/quantization/jobs/:id/progress SSE progress stream
POST /api/quantization/jobs/:id/import Import quantized model into LocalAI
GET /api/quantization/jobs/:id/download Download quantized GGUF file
GET /api/quantization/backends List available quantization backends

Job Request Fields

Field Type Description
model string HuggingFace model ID or local path (required)
backend string Backend name (default: llama-cpp-quantization)
quantization_type string Quantization format (default: q4_k_m)
extra_options map Backend-specific options (see below)

Extra Options

Key Description
hf_token HuggingFace token for gated models

Import Request Fields

Field Type Description
name string Model name for LocalAI (auto-generated if empty)

Job Status Values

Status Description
queued Job created, waiting to start
downloading Downloading model from HuggingFace
converting Converting model to f16 GGUF
quantizing Running quantization
completed Quantization finished successfully
failed Job failed (check message for details)
stopped Job stopped by user

Progress Stream

The GET /api/quantization/jobs/:id/progress endpoint returns Server-Sent Events (SSE) with JSON payloads:

{
  "job_id": "abc-123",
  "progress_percent": 65.0,
  "status": "quantizing",
  "message": "[ 234/ 567] quantizing blk.5.attn_k.weight ...",
  "output_file": "",
  "extra_metrics": {}
}

When the job completes, output_file contains the path to the quantized GGUF file and extra_metrics includes file_size_mb.

Quantization Types

Type Size Quality Description
q2_k Smallest Lowest 2-bit quantization
q3_k_s Very small Low 3-bit small
q3_k_m Very small Low 3-bit medium
q3_k_l Small Low-medium 3-bit large
q4_0 Small Medium 4-bit legacy
q4_k_s Small Medium 4-bit small
q4_k_m Small Good 4-bit medium (recommended)
q5_0 Medium Good 5-bit legacy
q5_k_s Medium Good 5-bit small
q5_k_m Medium Very good 5-bit medium
q6_k Large Excellent 6-bit
q8_0 Large Near-lossless 8-bit
f16 Largest Lossless 16-bit (no quantization, GGUF conversion only)

The UI also supports entering a custom quantization type string for any format supported by llama-quantize.

Web UI

A "Quantize" page appears in the sidebar under the Tools section. The UI provides:

  1. Job Configuration — Select model, quantization type (dropdown with presets or custom input), backend, and HuggingFace token
  2. Progress Monitor — Real-time progress bar and log output via SSE
  3. Jobs List — View all quantization jobs with status, stop/delete actions
  4. Output — Download the quantized GGUF file or import it directly into LocalAI for immediate use

Architecture

Quantization uses the same gRPC backend architecture as fine-tuning:

  1. Proto layer: QuantizationRequest, QuantizationProgress (streaming), StopQuantization
  2. Python backend: Downloads model, runs convert_hf_to_gguf.py and llama-quantize
  3. Go service: Manages job lifecycle, state persistence, async import
  4. REST API: HTTP endpoints with SSE progress streaming
  5. React UI: Configuration form, real-time progress monitor, download/import panel