+++ disableToc = false title = "Text Generation (GPT)" weight = 10 url = "/features/text-generation/" +++ LocalAI supports generating text with GPT with `llama.cpp` and other backends (such as `rwkv.cpp` as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families. Note: - You can also specify the model name as part of the OpenAI token. - If only one model is available, the API will use it for all the requests. ## API Reference ### Chat completions https://platform.openai.com/docs/api-reference/chat For example, to generate a chat completion, you can send a POST request to the `/v1/chat/completions` endpoint with the instruction as the request body: ```bash curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "messages": [{"role": "user", "content": "Say this is a test!"}], "temperature": 0.7 }' ``` Available additional parameters: `top_p`, `top_k`, `max_tokens` ### Edit completions https://platform.openai.com/docs/api-reference/edits To generate an edit completion you can send a POST request to the `/v1/edits` endpoint with the instruction as the request body: ```bash curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "instruction": "rephrase", "input": "Black cat jumped out of the window", "temperature": 0.7 }' ``` Available additional parameters: `top_p`, `top_k`, `max_tokens`. ### Completions https://platform.openai.com/docs/api-reference/completions To generate a completion, you can send a POST request to the `/v1/completions` endpoint with the instruction as per the request body: ```bash curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "prompt": "A long time ago in a galaxy far, far away", "temperature": 0.7 }' ``` Available additional parameters: `top_p`, `top_k`, `max_tokens` ### List models You can list all the models available with: ```bash curl http://localhost:8080/v1/models ``` ### Anthropic Messages API LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content. **Endpoint:** `POST /v1/messages` or `POST /messages` **Reference:** https://docs.anthropic.com/claude/reference/messages_post #### Basic Usage ```bash curl http://localhost:8080/v1/messages \ -H "Content-Type: application/json" \ -H "anthropic-version: 2023-06-01" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "max_tokens": 1024, "messages": [ {"role": "user", "content": "Say this is a test!"} ] }' ``` #### Request Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `model` | string | Yes | The model identifier | | `messages` | array | Yes | Array of message objects with `role` and `content` | | `max_tokens` | integer | Yes | Maximum number of tokens to generate (must be > 0) | | `system` | string | No | System message to set the assistant's behavior | | `temperature` | float | No | Sampling temperature (0.0 to 1.0) | | `top_p` | float | No | Nucleus sampling parameter | | `top_k` | integer | No | Top-k sampling parameter | | `stop_sequences` | array | No | Array of strings that will stop generation | | `stream` | boolean | No | Enable streaming responses | | `tools` | array | No | Array of tool definitions for function calling | | `tool_choice` | string/object | No | Tool choice strategy: "auto", "any", "none", or specific tool | | `metadata` | object | No | Per-request metadata passed to the backend (e.g., `{"enable_thinking": "true"}`) | #### Message Format Messages can contain text or structured content blocks: ```bash curl http://localhost:8080/v1/messages \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "max_tokens": 1024, "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What is in this image?" }, { "type": "image", "source": { "type": "base64", "media_type": "image/jpeg", "data": "base64_encoded_image_data" } } ] } ] }' ``` #### Tool Calling The Anthropic API supports function calling through tools: ```bash curl http://localhost:8080/v1/messages \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "max_tokens": 1024, "tools": [ { "name": "get_weather", "description": "Get the current weather", "input_schema": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state" } }, "required": ["location"] } } ], "tool_choice": "auto", "messages": [ {"role": "user", "content": "What is the weather in San Francisco?"} ] }' ``` #### Streaming Enable streaming responses by setting `stream: true`: ```bash curl http://localhost:8080/v1/messages \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "max_tokens": 1024, "stream": true, "messages": [ {"role": "user", "content": "Tell me a story"} ] }' ``` Streaming responses use Server-Sent Events (SSE) format with event types: `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, and `message_stop`. #### Response Format ```json { "id": "msg_abc123", "type": "message", "role": "assistant", "content": [ { "type": "text", "text": "This is a test!" } ], "model": "ggml-koala-7b-model-q4_0-r2.bin", "stop_reason": "end_turn", "usage": { "input_tokens": 10, "output_tokens": 5 } } ``` ### Open Responses API LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning. **Endpoint:** `POST /v1/responses` or `POST /responses` **Reference:** https://www.openresponses.org/specification #### Basic Usage ```bash curl http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "input": "Say this is a test!", "max_output_tokens": 1024 }' ``` #### Request Parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `model` | string | Yes | The model identifier | | `input` | string/array | Yes | Input text or array of input items | | `max_output_tokens` | integer | No | Maximum number of tokens to generate | | `temperature` | float | No | Sampling temperature | | `top_p` | float | No | Nucleus sampling parameter | | `instructions` | string | No | System instructions | | `tools` | array | No | Array of tool definitions | | `tool_choice` | string/object | No | Tool choice: "auto", "required", "none", or specific tool | | `stream` | boolean | No | Enable streaming responses | | `background` | boolean | No | Run request in background (returns immediately) | | `store` | boolean | No | Whether to store the response | | `reasoning` | object | No | Reasoning configuration with `effort` and `summary` | | `parallel_tool_calls` | boolean | No | Allow parallel tool calls | | `max_tool_calls` | integer | No | Maximum number of tool calls | | `presence_penalty` | float | No | Presence penalty (-2.0 to 2.0) | | `frequency_penalty` | float | No | Frequency penalty (-2.0 to 2.0) | | `top_logprobs` | integer | No | Number of top logprobs to return | | `truncation` | string | No | Truncation mode: "auto" or "disabled" | | `text_format` | object | No | Text format configuration | | `metadata` | object | No | Custom metadata | #### Input Format Input can be a simple string or an array of structured items: ```bash curl http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "input": [ { "type": "message", "role": "user", "content": "What is the weather?" } ], "max_output_tokens": 1024 }' ``` #### Background Processing Run requests in the background for long-running tasks: ```bash curl http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "input": "Generate a long story", "max_output_tokens": 4096, "background": true }' ``` The response will include a response ID that can be used to poll for completion: ```json { "id": "resp_abc123", "object": "response", "status": "in_progress", "created_at": 1234567890 } ``` #### Retrieving Background Responses Use the GET endpoint to retrieve background responses: ```bash # Get response by ID curl http://localhost:8080/v1/responses/resp_abc123 # Resume streaming with query parameters curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10" ``` #### Canceling Background Responses Cancel a background response that's still in progress: ```bash curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel ``` #### Tool Calling Open Responses API supports function calling with tools: ```bash curl http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "input": "What is the weather in San Francisco?", "tools": [ { "type": "function", "name": "get_weather", "description": "Get the current weather", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state" } }, "required": ["location"] } } ], "tool_choice": "auto", "max_output_tokens": 1024 }' ``` #### Reasoning Configuration Configure reasoning effort and summary style: ```bash curl http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "ggml-koala-7b-model-q4_0-r2.bin", "input": "Solve this complex problem step by step", "reasoning": { "effort": "high", "summary": "detailed" }, "max_output_tokens": 2048 }' ``` #### Response Format ```json { "id": "resp_abc123", "object": "response", "created_at": 1234567890, "completed_at": 1234567895, "status": "completed", "model": "ggml-koala-7b-model-q4_0-r2.bin", "output": [ { "type": "message", "id": "msg_001", "role": "assistant", "content": [ { "type": "output_text", "text": "This is a test!", "annotations": [], "logprobs": [] } ], "status": "completed" } ], "error": null, "incomplete_details": null, "temperature": 0.7, "top_p": 1.0, "presence_penalty": 0.0, "frequency_penalty": 0.0, "usage": { "input_tokens": 10, "output_tokens": 5, "total_tokens": 15, "input_tokens_details": { "cached_tokens": 0 }, "output_tokens_details": { "reasoning_tokens": 0 } } } ``` ## Backends ### RWKV RWKV support is available through llama.cpp (see below) ### llama.cpp [llama.cpp](https://github.com/ggerganov/llama.cpp) is a popular port of Facebook's LLaMA model in C/C++. {{% notice note %}} The `ggml` file format has been deprecated. If you are using `ggml` models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For `gguf` models, use the `llama` backend. The go backend is deprecated as well but still available as `go-llama`. {{% /notice %}} #### Features The `llama.cpp` model supports the following features: - [📖 Text generation (GPT)]({{%relref "features/text-generation" %}}) - [🧠 Embeddings]({{%relref "features/embeddings" %}}) - [🔥 OpenAI functions]({{%relref "features/openai-functions" %}}) - [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}}) #### Setup LocalAI supports `llama.cpp` models out of the box. You can use the `llama.cpp` model in the same way as any other model. ##### Manual setup It is sufficient to copy the `ggml` or `gguf` model files in the `models` folder. You can refer to the model in the `model` parameter in the API calls. [You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt. Prompt templates are useful for models that are fine-tuned towards a specific prompt. ##### Automatic setup LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for `ggml` or `gguf` models. For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running: ```bash curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin", "messages": [{"role": "user", "content": "Say this is a test!"}], "temperature": 0.1 }' ``` LocalAI will automatically download and configure the model in the `model` directory. Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}). #### YAML configuration To use the `llama.cpp` backend, specify `llama-cpp` as the backend in the YAML file: ```yaml name: llama backend: llama-cpp parameters: # Relative to the models path model: file.gguf ``` #### Backend Options The `llama.cpp` backend supports additional configuration options that can be specified in the `options` field of your model YAML configuration. These options allow fine-tuning of the backend behavior: | Option | Type | Description | Example | |--------|------|-------------|---------| | `use_jinja` or `jinja` | boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | `use_jinja:true` | | `context_shift` | boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | `context_shift:true` | | `cache_ram` | integer | Size budget in MiB for the **server-side prompt cache** (a host-RAM store of idle slot KV states that's reloaded on a prompt-prefix hit, see [upstream PR #16391](https://github.com/ggml-org/llama.cpp/pull/16391)). Default: `-1` (no limit). `0` disables the prompt cache entirely. Together with `kv_unified` and `cache_idle_slots` this is what makes a repeated system prompt skip prefill on subsequent calls. | `cache_ram:4096` | | `parallel` or `n_parallel` | integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | `parallel:4` | | `grpc_servers` or `rpc_servers` | string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | `grpc_servers:localhost:50051,localhost:50052` | | `fit_params` or `fit` | boolean | Enable auto-adjustment of model/context parameters to fit available device memory. Default: `true`. | `fit_params:true` | | `fit_params_target` or `fit_target` | integer | Target margin per device in MiB when using fit_params. Default: `1024` (1GB). | `fit_target:2048` | | `fit_params_min_ctx` or `fit_ctx` | integer | Minimum context size that can be set by fit_params. Default: `4096`. | `fit_ctx:2048` | | `n_cache_reuse` or `cache_reuse` | integer | Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled). | `cache_reuse:256` | | `slot_prompt_similarity` or `sps` | float | How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable. | `sps:0.5` | | `swa_full` | boolean | Use full-size SWA (Sliding Window Attention) cache. Default: `false`. | `swa_full:true` | | `cont_batching` or `continuous_batching` | boolean | Enable continuous batching for handling multiple sequences. Default: `true`. | `cont_batching:true` | | `check_tensors` | boolean | Validate tensor data for invalid values during model loading. Default: `false`. | `check_tensors:true` | | `warmup` | boolean | Enable warmup run after model loading. Default: `true`. | `warmup:false` | | `no_op_offload` | boolean | Disable offloading host tensor operations to device. Default: `false`. | `no_op_offload:true` | | `kv_unified` or `unified_kv` | boolean | Use a single unified KV buffer shared across all sequences. Default: `true` (LocalAI override; upstream defaults to `false` but auto-enables it when slot count is auto). **Required for `cache_idle_slots` to work**: without it the server force-disables idle-slot saving at init, and the prompt cache is never written across requests. | `kv_unified:false` | | `cache_idle_slots` or `idle_slots_cache` | boolean | On a new task, save the previous slot's KV state into the prompt cache (and clear the slot) so a later request with the same prefix can warm-load it. Default: `true`. Auto-disabled by the server if `kv_unified=false` or `cache_ram=0`. | `cache_idle_slots:false` | | `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot (used for partial-prefix recovery, e.g. SWA). Default: `32`. | `ctx_checkpoints:16` | | `checkpoint_min_step` or `checkpoint_min_spacing` (aliases: `checkpoint_every_nt`, `checkpoint_every_n_tokens`) | integer | Minimum spacing in tokens between context checkpoints. `0` disables the minimum-spacing gate. Default: `256`. (Renamed upstream from `checkpoint_every_nt`; semantics shifted from a fixed cadence to a minimum spacing.) | `checkpoint_min_step:1024` | | `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism, requires `flash_attention: true`, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378); it historically also required KV-cache quantization to be disabled, but [#23792](https://github.com/ggml-org/llama.cpp/pull/23792) lifts that restriction so `cache_type_k`/`cache_type_v` quantization can be combined with tensor parallelism on builds that include it). | `split_mode:tensor` | **Example configuration with options:** ```yaml name: llama-model backend: llama parameters: model: model.gguf options: - use_jinja:true - context_shift:true - cache_ram:4096 - parallel:2 - fit_params:true - fit_target:1024 - slot_prompt_similarity:0.5 ``` **Note:** The `parallel` option can also be set via the `LLAMACPP_PARALLEL` environment variable, and `grpc_servers` can be set via the `LLAMACPP_GRPC_SERVERS` environment variable. Options specified in the YAML file take precedence over environment variables. ##### Server-side prompt cache (repeated system prompts) Agents, coding assistants, and Anthropic/OpenAI-compatible CLIs typically resend the same large system prompt on every turn. The llama.cpp server can short-circuit prefill for the matching prefix by stashing idle slot KV states in host RAM and reloading them on a hit. Three settings interact: | Setting | Default | Role | |---|---|---| | `cache_ram:N` | `-1` (no limit) | Allocates the host-side prompt cache. `0` disables it. | | `kv_unified:true` | `true` | Single unified KV buffer (**prerequisite** for idle-slot saving). | | `cache_idle_slots:true` | `true` | Persists the idle slot's KV into the prompt cache on task switch. | All three are on by default since LocalAI v4.3, so the prompt cache works out of the box for the common single-slot setup. If you're on an older release, or you've explicitly disabled one of them, add the following to recover the behaviour: ```yaml options: - cache_ram:4096 # or -1 for no limit - kv_unified:true - cache_idle_slots:true ``` Set `cache_ram:0` to opt out of the prompt cache entirely (saves host RAM at the cost of re-prefilling repeated prompts). #### Reference - [llama](https://github.com/ggerganov/llama.cpp) ### ik_llama.cpp [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) is a hard fork of `llama.cpp` by Iwan Kawrakow that focuses on superior CPU and hybrid GPU/CPU performance. It ships additional quantization types (IQK quants), custom quantization mixes, Multi-head Latent Attention (MLA) for DeepSeek models, and fine-grained tensor offload controls — particularly useful for running very large models on commodity CPU hardware. {{% notice note %}} The `ik-llama-cpp` backend requires a CPU with **AVX2** support. The IQK kernels are not compatible with older CPUs. {{% /notice %}} #### Features The `ik-llama-cpp` backend supports the following features: - [📖 Text generation (GPT)]({{%relref "features/text-generation" %}}) - [🧠 Embeddings]({{%relref "features/embeddings" %}}) - IQK quantization types for better CPU inference performance - Multimodal models (via clip/llava) #### Setup The backend is distributed as a separate container image and can be installed from the LocalAI backend gallery, or specified directly in a model configuration. GGUF models loaded with this backend benefit from ik_llama.cpp's optimized CPU kernels — especially useful for MoE models and large quantized models that would otherwise be GPU-bound. #### YAML configuration To use the `ik-llama-cpp` backend, specify it as the backend in the YAML file: ```yaml name: my-model backend: ik-llama-cpp parameters: # Relative to the models path model: file.gguf ``` The aliases `ik-llama` and `ik_llama` are also accepted. #### Reference - [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) ### turboquant (llama.cpp fork with TurboQuant KV-cache) [llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) is a `llama.cpp` fork that adds the **TurboQuant KV-cache** quantization scheme. It reuses the upstream `llama.cpp` codebase and ships as a drop-in alternative backend inside LocalAI, sharing the same gRPC server sources as the stock `llama-cpp` backend — so any GGUF model that runs on `llama-cpp` also runs on `turboquant`. You would pick `turboquant` when you want **smaller KV-cache memory pressure** (longer contexts on the same VRAM) or to experiment with the fork's quantized KV representations on top of the standard `cache_type_k` / `cache_type_v` knobs already supported by upstream `llama.cpp`. #### Features - Drop-in GGUF compatibility with upstream `llama.cpp`. - TurboQuant KV-cache quantization (see fork README for the current set of accepted `cache_type_k` / `cache_type_v` values). - Same feature surface as the `llama-cpp` backend: text generation, embeddings, tool calls, multimodal via mmproj. - Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T. #### Setup `turboquant` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend: ```bash local-ai backends install turboquant ``` Or pick a specific flavor for your hardware (example tags: `cpu-turboquant`, `cuda12-turboquant`, `cuda13-turboquant`, `rocm-turboquant`, `intel-sycl-f16-turboquant`, `vulkan-turboquant`). #### YAML configuration To run a model with `turboquant`, set the backend in your model YAML and optionally pick quantized KV-cache types: ```yaml name: my-model backend: turboquant parameters: # Relative to the models path model: file.gguf # Use TurboQuant's own KV-cache quantization schemes. The fork accepts # the standard llama.cpp types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1) # and adds three TurboQuant-specific ones: turbo2, turbo3, turbo4. # turbo3 / turbo4 auto-enable flash_attention (required for turbo K/V) # and offer progressively more aggressive compression. cache_type_k: turbo3 cache_type_v: turbo3 context_size: 8192 ``` The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` flags. The stock `llama-cpp` backend only accepts the standard llama.cpp types — to use `turbo2` / `turbo3` / `turbo4` you need this `turboquant` backend, which is where the fork's TurboQuant code paths actually take effect. Pick `q8_0` here and you're just running stock llama.cpp KV quantization; pick `turbo*` and you're running TurboQuant. #### Reference - [llama-cpp-turboquant](https://github.com/TheTom/llama-cpp-turboquant) - [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache) ### dllm (DiffusionGemma block-diffusion) [dllm.cpp](https://github.com/mudler/dllm.cpp) is a standalone C++/ggml engine for **DiffusionGemma** block-diffusion language models (GGUF weights). Instead of sampling one token at a time, generation works on fixed-size token **canvases** (256 tokens for the published model): each canvas is iteratively denoised with the Entropy-Bound (EB) sampler, committed as a whole block, and committed blocks feed back as prompt for the next canvas. LocalAI wraps the engine with a native Go backend (`dllm`) that also owns chat templating and output parsing: the model's thought channels and tool calls stream natively as `reasoning_content` and `tool_calls` deltas, with no jinja template involved. {{% notice note %}} This backend is **experimental**, and the engine does not yet have a prompt-KV prefix cache: every denoise step recomputes the full prompt+canvas forward pass, so throughput is low (~0.15 tok/s at default settings on a single GB10 GPU) and drops further as the context fills up. The prefix cache is the planned fix in upstream dllm.cpp. {{% /notice %}} #### Features - [📖 Text generation (GPT)]({{%relref "features/text-generation" %}}) - [🔥 OpenAI functions]({{%relref "features/openai-functions" %}}) - tool calls are parsed natively by the backend (gemma4 `<|tool_call>` markers), not by LocalAI's grammar/regex fallback - Reasoning - opt-in thinking streams as `reasoning_content` (see below) - Request cancellation - disconnecting the client (or a request timeout) aborts the in-flight generation server-side, so an abandoned slow run does not keep the GPU busy #### Supported platforms | Flavor | Hardware | |---|---| | `cpu-dllm` | CPU (amd64 + arm64) - functional but very slow on the 26B model; mainly useful for wiring tests | | `cuda13-dllm` | NVIDIA CUDA 13 (amd64) | | `cuda13-nvidia-l4t-arm64-dllm` | NVIDIA L4T arm64 (Jetson / DGX Spark GB10) | macOS/Metal is not available yet. #### Setup The easiest path is the model gallery; the entry installs the backend and the model together: ```bash local-ai models install diffusiongemma-26b-a4b-it ``` Or configure it manually with a YAML file pointing at the GGUF (BF16 is the only published file the engine's validation is calibrated for; the model card flags quantized MoE exports as problematic): ```yaml name: diffusiongemma backend: dllm parameters: model: diffusiongemma-26B-A4B-it-BF16.gguf context_size: 4096 stopwords: - # The backend parses tool calls natively; keep LocalAI's generated tool # grammar from overriding that pipeline. function: grammar: disable: true template: use_tokenizer_template: true ``` `use_tokenizer_template: true` is what routes chat requests through the backend's native gemma4 renderer/parser (messages and tools in, `content`/`reasoning_content`/`tool_calls` out). Without it, your own prompt template output is passed to the engine verbatim and the raw model text comes back as plain content. #### Backend options Model-level generation options go in the `options:` array (format: `key:value`), like other backends: ```yaml options: - eb_max_steps:24 - kv_cache:auto ``` | Option | Type | Default | Description | |--------|------|---------|-------------| | `blocks` | integer | unset | Generation budget in whole diffusion canvases (`blocks * canvas_length` tokens, 256 per canvas for the published model). Must be >= 1. When both `blocks` and a token budget are present, `blocks` wins. | | `kv_cache` | string | `auto` | One of `auto`, `off`, `on`. The engine has no KV cache yet, so `auto` and `off` are accepted no-ops; `kv_cache:on` fails the request until the prefix-KV cache lands upstream. | | `eb_max_steps` | integer | 48 | Maximum denoise steps per canvas. Blocks exit early once stable **and** confident, so this is a ceiling, not a fixed cost. Lower values are faster but can degrade quality. | | `eb_t_min` | float | 0.4 | Lower bound of the linear temperature schedule. | | `eb_t_max` | float | 0.8 | Upper bound of the linear temperature schedule: `t = t_min + (t_max - t_min) * cur_step/max_steps`, with `cur_step` counting down, so denoising anneals from `t_max` toward `t_min`. | | `eb_entropy_bound` | float | 0.1 | Per-step acceptance budget: canvas positions are sorted by entropy (ascending) and accepted while the cumulative entropy, minus the position's own, stays at or below the bound. Higher accepts more tokens per step (faster, riskier). | | `eb_stability_threshold` | integer | 1 | Consecutive identical argmax canvases required before a block counts as stable (`0` = always stable; at `1` the earliest exit is the 2nd identical step). | | `eb_confidence_threshold` | float | 0.005 | Mean-entropy ceiling for the "confident" half of the early-exit test; a block stops denoising only when it is both stable and below this. | Defaults for the `eb_*` knobs come from the GGUF's `diffusion.*` metadata when present, falling back to the engine defaults shown (DiffusionGemma's canonical values). The published `diffusiongemma-26B-A4B-it` GGUF carries only `diffusion.canvas_length`, so the fallbacks above are what you actually get. Per-request parameters: `max_tokens` maps to the engine's `n_predict` (omitted: engine default of 256), and a **positive** `seed` gives deterministic output (absent, zero or negative = a fresh random seed per call). Autoregressive sampling fields (`temperature`, `top_p`, `top_k`, ...) are **not used**: the EB sampler's own temperature schedule (`eb_t_min`/`eb_t_max`) replaces them. {{% notice note %}} **`max_tokens` rounds up to whole canvases.** The scheduler always commits whole canvases, so the token budget rounds **up** to `ceil(n_predict / canvas_length)` blocks and the completion may run slightly past the requested `max_tokens` (canonical DiffusionGemma behavior). Generation can still end earlier when the model emits an end-of-turn token, which finalizes the canvas. {{% /notice %}} #### Thinking DiffusionGemma's chat template makes thinking **opt-in** (the default render pre-closes an empty thought channel), so the backend defaults to thinking OFF - the opposite of most reasoning models. Enable it per request via the `metadata` field ([per-request override]({{%relref "advanced/model-configuration#per-request-override-via-metadata" %}})): ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "diffusiongemma", "messages": [{"role": "user", "content": "Explain quantum computing"}], "metadata": {"enable_thinking": "true"} }' ``` The model's thought channel then streams as `reasoning_content`, separate from the final `content`. #### Performance expectations Honest numbers from validation on a DGX Spark (GB10, CUDA 13, BF16 26B model, full GPU offload): - Engine load: ~33 s (50 GB of weights to GPU) - Forward pass: ~5.6 s per denoise step (256-token canvas); a block takes up to `eb_max_steps` steps but typically exits early (24/48 observed on a normal prompt, 4 steps on a trivial one) - End-to-end: ~0.15 tok/s at default settings, dominated by the per-step full recompute - this is the cost the upstream prefix-KV cache work targets On CPU the same forward step takes ~139 s (20 Grace cores): treat the CPU flavor as functional, not practical, for the 26B model. **Quantized models.** The Q4_K_M export (16.8 GB vs 50.5 GB BF16) was validated on the same GB10: it loads faster (~12.6 s vs ~32.7 s), quality held up in validation (golden-logits cosine 0.9862, coherent generation on the same prompt as the BF16 run, EB stopper exiting at 19/48 steps, ~0.49 tok/s on that run) - but a forward step takes ~27.5 s, about **5x slower than BF16** (~5.6 s/step) on this hardware. GB10-class GPUs run BF16 natively on tensor cores, while the K-quant MoE weights pay a dequantization cost on every denoise step. Choose Q4_K_M only when you are memory-bound; if BF16 fits, it is both faster and the file the engine's validation tolerances are calibrated for. #### Reference - [dllm.cpp](https://github.com/mudler/dllm.cpp) - [unsloth/diffusiongemma-26B-A4B-it-GGUF](https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF) ### vLLM [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference. LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out `vllm` performance [here](https://github.com/vllm-project/vllm#performance). #### Setup Create a YAML file for the model you want to use with `vllm`. To setup a model, you need to just specify the model name in the YAML config file: ```yaml name: vllm backend: vllm parameters: model: "facebook/opt-125m" ``` The backend will automatically download the required files in order to run the model. #### Usage Use the `completions` endpoint by specifying the `vllm` backend: ``` curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "vllm", "prompt": "Hello, my name is", "temperature": 0.1, "top_p": 0.1 }' ``` #### Passing arbitrary vLLM options with `engine_args` A subset of `AsyncEngineArgs` is exposed as typed YAML fields (`tensor_parallel_size`, `gpu_memory_utilization`, `quantization`, `max_model_len`, `dtype`, `trust_remote_code`, `enforce_eager`, …). Anything else can be passed through the generic `engine_args:` map. Keys are forwarded verbatim to vLLM's engine; unknown keys fail at load time with the closest valid name as a hint. Nested maps materialise into vLLM's nested config dataclasses (`SpeculativeConfig`, `KVTransferConfig`, `CompilationConfig`, …). Speculative decoding (DFlash, ngram, eagle, deepseek_mtp, …) is configured this way: ```yaml name: qwen3.5-4b-dflash backend: vllm parameters: model: Qwen/Qwen3.5-4B context_size: 8192 max_model_len: 8192 trust_remote_code: true quantization: fp8 template: use_tokenizer_template: true engine_args: speculative_config: method: dflash model: z-lab/Qwen3.5-4B-DFlash num_speculative_tokens: 15 ``` The shape of `speculative_config` follows vLLM's [`SpeculativeConfig`](https://docs.vllm.ai/en/latest/api/vllm/config/speculative.html) — `method` picks the algorithm, the remaining keys are method-specific. Drafters from [z-lab](https://huggingface.co/z-lab) are paired with specific target models; pick the one that matches your target. The drafter loads in its native precision regardless of the target's `quantization:` setting. Another example — picking a non-default attention backend (e.g. on hardware where the default cutlass kernels aren't supported): ```yaml engine_args: attention_backend: TRITON_ATTN ``` #### Multi-node data parallelism `engine_args.data_parallel_size > 1` combined with the `local-ai p2p-worker vllm` follower lets a single model span multiple GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref "features/distributed-mode#vllm-multi-node-data-parallel" %}}) for the head/follower configuration and a worked Kimi-K2.6 example. ### SGLang [SGLang](https://github.com/sgl-project/sglang) is a fast serving framework for LLMs and VLMs with a focus on prefix caching, speculative decoding, and multi-modal generation. LocalAI ships a gRPC backend that wraps SGLang's async `Engine`, including its native function-call and reasoning parsers. #### Setup ```yaml name: sglang backend: sglang parameters: model: "Qwen/Qwen3-4B" template: use_tokenizer_template: true ``` The backend will pull the model from HuggingFace on first load. #### Passing arbitrary SGLang options with `engine_args` The same `engine_args:` map that the vLLM backend accepts is also honoured by the SGLang backend. Keys are validated against [`ServerArgs`](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py) — SGLang's central configuration dataclass — and forwarded verbatim to `Engine(**kwargs)`. Unknown keys fail at load time with the closest valid name as a hint. Unlike vLLM, `ServerArgs` is flat: speculative decoding fields are top-level (`speculative_algorithm`, `speculative_draft_model_path`, etc.) rather than nested under a `speculative_config:` dict. The typed YAML fields shared with vLLM are mapped to their SGLang equivalents (`gpu_memory_utilization` → `mem_fraction_static`, `enforce_eager` → `disable_cuda_graph`, `tensor_parallel_size` → `tp_size`, `max_model_len` → `context_length`). Anything else, including all speculative-decoding flags, goes under `engine_args:`. ##### Speculative decoding: Gemma 4 with Multi-Token Prediction Google publishes paired "assistant" drafters for every Gemma 4 size. The drafters use Multi-Token Prediction (MTP) to propose several candidate tokens per target step, which SGLang then verifies in parallel. Flags below are transcribed verbatim from the [SGLang Gemma 4 cookbook](https://docs.sglang.io/cookbook/autoregressive/Google/Gemma4#speculative-decoding-mtp-server-commands). For consumer GPUs in the 16–24 GB range, use **E4B** (8 B total / 4 B effective parameters): ```yaml name: gemma-4-e4b-mtp backend: sglang parameters: model: google/gemma-4-E4B-it context_size: 4096 template: use_tokenizer_template: true options: - tool_parser:gemma4 - reasoning_parser:gemma4 engine_args: mem_fraction_static: 0.85 speculative_algorithm: NEXTN speculative_draft_model_path: google/gemma-4-E4B-it-assistant speculative_num_steps: 5 speculative_num_draft_tokens: 6 speculative_eagle_topk: 1 ``` For smaller cards (8–12 GB), drop to **E2B** (5 B total / 2 B effective) by swapping the model paths to `google/gemma-4-E2B-it` and `google/gemma-4-E2B-it-assistant`; the rest of the flags stay the same. `NEXTN` is normalised to `EAGLE` inside `ServerArgs.__post_init__`, so either value works — the cookbook uses `NEXTN`. `mem_fraction_static` is the share of GPU memory SGLang reserves for the model + KV pool; 0.85 is the cookbook's default and adapts to whatever single GPU the backend is running on. The 31 B dense and 26 B-A4B MoE Gemma 4 variants exist in the same cookbook but require `--tp-size 2`, so they're not in the gallery as single-GPU recipes. > **SGLang version requirement.** Gemma 4 support landed in SGLang via > [PR #21952](https://github.com/sgl-project/sglang/pull/21952). The > LocalAI sglang backend pins a release that includes it; if you've > overridden the pin to an older version, this recipe will fail with a > "model architecture not recognised" error at load time. ##### Other speculative algorithms `speculative_algorithm:` also accepts `EAGLE`/`EAGLE3` (paired with an EAGLE-style draft head), `DFLASH` (block-diffusion drafters from [z-lab](https://huggingface.co/z-lab) for the Qwen3 family), `STANDALONE` (a smaller draft LLM verifying a larger target), and `NGRAM` (no draft model — pure prefix-history speculation). See SGLang's [speculative-decoding docs](https://docs.sglang.io/advanced_features/speculative_decoding.html) for the full algorithm matrix. #### Tool calling and reasoning parsers SGLang's native parsers stream `tool_calls` and `reasoning_content` inside `ChatDelta` — the LocalAI Python backend wires them up per-request rather than via `engine_args:`. Pick a parser by name: ```yaml options: - tool_parser:hermes - reasoning_parser:deepseek_r1 ``` The full list of registered parsers lives in `sglang.srt.function_call` and `sglang.srt.parser.reasoning_parser`. ### Transformers [Transformers](https://huggingface.co/docs/transformers/index) is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX. LocalAI has a built-in integration with Transformers, and it can be used to run models. This is an extra backend - in the container images (the `extra` images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup. #### Setup Create a YAML file for the model you want to use with `transformers`. To setup a model, you need to just specify the model name in the YAML config file: ```yaml name: transformers backend: transformers parameters: model: "facebook/opt-125m" type: AutoModelForCausalLM quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional) ``` The backend will automatically download the required files in order to run the model. #### Parameters ##### Type | Type | Description | | --- | --- | | `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration | | `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models | | `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration | | N/A | Defaults to `AutoModel` | - `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face - `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model) Please note that streaming is currently not implemented in `AutoModelForCausalLM` for Intel GPU. AMD GPU support is not implemented. Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV. ##### Embeddings Use `embeddings: true` if the model is an embedding model ##### Inference device selection Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter. | Inference Engine | Applicable Values | | --- | --- | | CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output | | OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` | Example for CUDA: `main_gpu: cuda.0` Example for OpenVINO: `main_gpu: AUTO:-CPU` This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models. ##### Inference Precision Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter: `f16: true` ##### Quantization | Quantization | Description | | --- | --- | | `bnb_8bit` | 8-bit quantization | | `bnb_4bit` | 4-bit quantization | | `xpu_8bit` | 8-bit quantization for Intel XPUs | | `xpu_4bit` | 4-bit quantization for Intel XPUs | ##### Trust Remote Code Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library. By default it is disabled for security. It can be manually enabled with: `trust_remote_code: true` ##### Maximum Context Size Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support. Usage example: `context_size: 8192` ##### Auto Prompt Template Usually chat template is defined by the model author in the `tokenizer_config.json` file. To enable it use the `use_tokenizer_template: true` parameter in the `template` section. Usage example: ``` template: use_tokenizer_template: true ``` ##### Custom Stop Words Stopwords are usually defined in `tokenizer_config.json` file. They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model. Usage example: ``` stopwords: - "<|eot_id|>" - "<|end_of_text|>" ``` #### Usage Use the `completions` endpoint by specifying the `transformers` model: ``` curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "model": "transformers", "prompt": "Hello, my name is", "temperature": 0.1, "top_p": 0.1 }' ``` #### Examples ##### OpenVINO A model configuration file for openvion and starling model: ```yaml name: starling-openvino backend: transformers parameters: model: fakezeta/Starling-LM-7B-beta-openvino-int8 context_size: 8192 threads: 6 f16: true type: OVModelForCausalLM stopwords: - <|end_of_turn|> - <|endoftext|> prompt_cache_path: "cache" prompt_cache_all: true template: chat_message: | {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}} chat: | {{.Input}}<|end_of_turn|>GPT4 Correct Assistant: completion: | {{.Input}} ```