LocalAI/docs/content/features/text-generation.md at efae3fd97b028800c4a66d441d61dfc90bbe2dfa

mirror of https://github.com/mudler/LocalAI.git synced 2026-04-17 05:18:53 -04:00

Files

Ettore Di Giacinto 95efb8a562 feat(backend): add turboquant llama.cpp-fork backend (#9355 )

* feat(backend): add turboquant llama.cpp-fork backend

turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.

Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.

scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.

* feat(turboquant): carry upstream patches against fork API drift

turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.

Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).

Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.

* docs: add turboquant backend section + clarify cache_type_k/v

Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.

Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.

* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion

The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.

* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e

The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.

Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.

Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.

Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.

* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3

The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.

* Apply suggestion from @mudler

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>

2026-04-15 01:25:04 +02:00

28 KiB

Raw Blame History

+++ disableToc = false title = "Text Generation (GPT)" weight = 10 url = "/features/text-generation/" +++

LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.

Note:

You can also specify the model name as part of the OpenAI token.
If only one model is available, the API will use it for all the requests.

API Reference

Chat completions

https://platform.openai.com/docs/api-reference/chat

For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "messages": [{"role": "user", "content": "Say this is a test!"}],
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

Edit completions

https://platform.openai.com/docs/api-reference/edits

To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:

curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "instruction": "rephrase",
  "input": "Black cat jumped out of the window",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens.

Completions

https://platform.openai.com/docs/api-reference/completions

To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "prompt": "A long time ago in a galaxy far, far away",
  "temperature": 0.7
}'

Available additional parameters: top_p, top_k, max_tokens

List models

You can list all the models available with:

curl http://localhost:8080/v1/models

Anthropic Messages API

LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content.

Endpoint: POST /v1/messages or POST /messages

Reference: https://docs.anthropic.com/claude/reference/messages_post

Basic Usage

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "messages": [
      {"role": "user", "content": "Say this is a test!"}
    ]
  }'

Request Parameters

Parameter	Type	Required	Description
`model`	string	Yes	The model identifier
`messages`	array	Yes	Array of message objects with `role` and `content`
`max_tokens`	integer	Yes	Maximum number of tokens to generate (must be > 0)
`system`	string	No	System message to set the assistant's behavior
`temperature`	float	No	Sampling temperature (0.0 to 1.0)
`top_p`	float	No	Nucleus sampling parameter
`top_k`	integer	No	Top-k sampling parameter
`stop_sequences`	array	No	Array of strings that will stop generation
`stream`	boolean	No	Enable streaming responses
`tools`	array	No	Array of tool definitions for function calling
`tool_choice`	string/object	No	Tool choice strategy: "auto", "any", "none", or specific tool
`metadata`	object	No	Per-request metadata passed to the backend (e.g., `{"enable_thinking": "true"}`)

Message Format

Messages can contain text or structured content blocks:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image",
            "source": {
              "type": "base64",
              "media_type": "image/jpeg",
              "data": "base64_encoded_image_data"
            }
          }
        ]
      }
    ]
  }'

Tool Calling

The Anthropic API supports function calling through tools:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "tools": [
      {
        "name": "get_weather",
        "description": "Get the current weather",
        "input_schema": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "tool_choice": "auto",
    "messages": [
      {"role": "user", "content": "What is the weather in San Francisco?"}
    ]
  }'

Streaming

Enable streaming responses by setting stream: true:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "max_tokens": 1024,
    "stream": true,
    "messages": [
      {"role": "user", "content": "Tell me a story"}
    ]
  }'

Streaming responses use Server-Sent Events (SSE) format with event types: message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.

Response Format

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "This is a test!"
    }
  ],
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 10,
    "output_tokens": 5
  }
}

Open Responses API

LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning.

Endpoint: POST /v1/responses or POST /responses

Reference: https://www.openresponses.org/specification

Basic Usage

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Say this is a test!",
    "max_output_tokens": 1024
  }'

Request Parameters

Parameter	Type	Required	Description
`model`	string	Yes	The model identifier
`input`	string/array	Yes	Input text or array of input items
`max_output_tokens`	integer	No	Maximum number of tokens to generate
`temperature`	float	No	Sampling temperature
`top_p`	float	No	Nucleus sampling parameter
`instructions`	string	No	System instructions
`tools`	array	No	Array of tool definitions
`tool_choice`	string/object	No	Tool choice: "auto", "required", "none", or specific tool
`stream`	boolean	No	Enable streaming responses
`background`	boolean	No	Run request in background (returns immediately)
`store`	boolean	No	Whether to store the response
`reasoning`	object	No	Reasoning configuration with `effort` and `summary`
`parallel_tool_calls`	boolean	No	Allow parallel tool calls
`max_tool_calls`	integer	No	Maximum number of tool calls
`presence_penalty`	float	No	Presence penalty (-2.0 to 2.0)
`frequency_penalty`	float	No	Frequency penalty (-2.0 to 2.0)
`top_logprobs`	integer	No	Number of top logprobs to return
`truncation`	string	No	Truncation mode: "auto" or "disabled"
`text_format`	object	No	Text format configuration
`metadata`	object	No	Custom metadata

Input Format

Input can be a simple string or an array of structured items:

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": [
      {
        "type": "message",
        "role": "user",
        "content": "What is the weather?"
      }
    ],
    "max_output_tokens": 1024
  }'

Background Processing

Run requests in the background for long-running tasks:

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Generate a long story",
    "max_output_tokens": 4096,
    "background": true
  }'

The response will include a response ID that can be used to poll for completion:

{
  "id": "resp_abc123",
  "object": "response",
  "status": "in_progress",
  "created_at": 1234567890
}

Retrieving Background Responses

Use the GET endpoint to retrieve background responses:

# Get response by ID
curl http://localhost:8080/v1/responses/resp_abc123

# Resume streaming with query parameters
curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10"

Canceling Background Responses

Cancel a background response that's still in progress:

curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel

Tool Calling

Open Responses API supports function calling with tools:

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "What is the weather in San Francisco?",
    "tools": [
      {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather",
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state"
            }
          },
          "required": ["location"]
        }
      }
    ],
    "tool_choice": "auto",
    "max_output_tokens": 1024
  }'

Reasoning Configuration

Configure reasoning effort and summary style:

curl http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-koala-7b-model-q4_0-r2.bin",
    "input": "Solve this complex problem step by step",
    "reasoning": {
      "effort": "high",
      "summary": "detailed"
    },
    "max_output_tokens": 2048
  }'

Response Format

{
  "id": "resp_abc123",
  "object": "response",
  "created_at": 1234567890,
  "completed_at": 1234567895,
  "status": "completed",
  "model": "ggml-koala-7b-model-q4_0-r2.bin",
  "output": [
    {
      "type": "message",
      "id": "msg_001",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "This is a test!",
          "annotations": [],
          "logprobs": []
        }
      ],
      "status": "completed"
    }
  ],
  "error": null,
  "incomplete_details": null,
  "temperature": 0.7,
  "top_p": 1.0,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "usage": {
    "input_tokens": 10,
    "output_tokens": 5,
    "total_tokens": 15,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens_details": {
      "reasoning_tokens": 0
    }
  }
}

Backends

RWKV

RWKV support is available through llama.cpp (see below)

llama.cpp

llama.cpp is a popular port of Facebook's LLaMA model in C/C++.

The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.

Features

The llama.cpp model supports the following features:

[📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
[🧠 Embeddings]({{%relref "features/embeddings" %}})
[🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
[✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})

Setup

LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.

Manual setup

It is sufficient to copy the ggml or gguf model files in the models folder. You can refer to the model in the model parameter in the API calls.

[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.

Prompt templates are useful for models that are fine-tuned towards a specific prompt.

Automatic setup

LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.

For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
     "messages": [{"role": "user", "content": "Say this is a test!"}],
     "temperature": 0.1
   }'

LocalAI will automatically download and configure the model in the model directory.

Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).

YAML configuration

To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:

name: llama
backend: llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf

Backend Options

The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:

Option	Type	Description	Example
`use_jinja` or `jinja`	boolean	Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages.	`use_jinja:true`
`context_shift`	boolean	Enable context shifting, which allows the model to dynamically adjust context window usage.	`context_shift:true`
`cache_ram`	integer	Set the maximum RAM cache size in MiB for KV cache. Use `-1` for unlimited (default).	`cache_ram:2048`
`parallel` or `n_parallel`	integer	Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently.	`parallel:4`
`grpc_servers` or `rpc_servers`	string	Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers.	`grpc_servers:localhost:50051,localhost:50052`
`fit_params` or `fit`	boolean	Enable auto-adjustment of model/context parameters to fit available device memory. Default: `true`.	`fit_params:true`
`fit_params_target` or `fit_target`	integer	Target margin per device in MiB when using fit_params. Default: `1024` (1GB).	`fit_target:2048`
`fit_params_min_ctx` or `fit_ctx`	integer	Minimum context size that can be set by fit_params. Default: `4096`.	`fit_ctx:2048`
`n_cache_reuse` or `cache_reuse`	integer	Minimum chunk size to attempt reusing from the cache via KV shifting. Default: `0` (disabled).	`cache_reuse:256`
`slot_prompt_similarity` or `sps`	float	How much the prompt of a request must match the prompt of a slot to use that slot. Default: `0.1`. Set to `0` to disable.	`sps:0.5`
`swa_full`	boolean	Use full-size SWA (Sliding Window Attention) cache. Default: `false`.	`swa_full:true`
`cont_batching` or `continuous_batching`	boolean	Enable continuous batching for handling multiple sequences. Default: `true`.	`cont_batching:true`
`check_tensors`	boolean	Validate tensor data for invalid values during model loading. Default: `false`.	`check_tensors:true`
`warmup`	boolean	Enable warmup run after model loading. Default: `true`.	`warmup:false`
`no_op_offload`	boolean	Disable offloading host tensor operations to device. Default: `false`.	`no_op_offload:true`
`kv_unified` or `unified_kv`	boolean	Enable unified KV cache. Default: `false`.	`kv_unified:true`
`n_ctx_checkpoints` or `ctx_checkpoints`	integer	Maximum number of context checkpoints per slot. Default: `8`.	`ctx_checkpoints:4`

Example configuration with options:

name: llama-model
backend: llama
parameters:
  model: model.gguf
options:
  - use_jinja:true
  - context_shift:true
  - cache_ram:4096
  - parallel:2
  - fit_params:true
  - fit_target:1024
  - slot_prompt_similarity:0.5

Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.

Reference

llama

ik_llama.cpp

ik_llama.cpp is a hard fork of llama.cpp by Iwan Kawrakow that focuses on superior CPU and hybrid GPU/CPU performance. It ships additional quantization types (IQK quants), custom quantization mixes, Multi-head Latent Attention (MLA) for DeepSeek models, and fine-grained tensor offload controls — particularly useful for running very large models on commodity CPU hardware.

The ik-llama-cpp backend requires a CPU with AVX2 support. The IQK kernels are not compatible with older CPUs.

Features

The ik-llama-cpp backend supports the following features:

[📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
[🧠 Embeddings]({{%relref "features/embeddings" %}})
IQK quantization types for better CPU inference performance
Multimodal models (via clip/llava)

Setup

The backend is distributed as a separate container image and can be installed from the LocalAI backend gallery, or specified directly in a model configuration. GGUF models loaded with this backend benefit from ik_llama.cpp's optimized CPU kernels — especially useful for MoE models and large quantized models that would otherwise be GPU-bound.

YAML configuration

To use the ik-llama-cpp backend, specify it as the backend in the YAML file:

name: my-model
backend: ik-llama-cpp
parameters:
  # Relative to the models path
  model: file.gguf

The aliases ik-llama and ik_llama are also accepted.

Reference

ik_llama.cpp

turboquant (llama.cpp fork with TurboQuant KV-cache)

llama-cpp-turboquant is a llama.cpp fork that adds the TurboQuant KV-cache quantization scheme. It reuses the upstream llama.cpp codebase and ships as a drop-in alternative backend inside LocalAI, sharing the same gRPC server sources as the stock llama-cpp backend — so any GGUF model that runs on llama-cpp also runs on turboquant.

You would pick turboquant when you want smaller KV-cache memory pressure (longer contexts on the same VRAM) or to experiment with the fork's quantized KV representations on top of the standard cache_type_k / cache_type_v knobs already supported by upstream llama.cpp.

Features

Drop-in GGUF compatibility with upstream llama.cpp.
TurboQuant KV-cache quantization (see fork README for the current set of accepted cache_type_k / cache_type_v values).
Same feature surface as the llama-cpp backend: text generation, embeddings, tool calls, multimodal via mmproj.
Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T.

Setup

turboquant ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:

local-ai backends install turboquant

Or pick a specific flavor for your hardware (example tags: cpu-turboquant, cuda12-turboquant, cuda13-turboquant, rocm-turboquant, intel-sycl-f16-turboquant, vulkan-turboquant).

YAML configuration

To run a model with turboquant, set the backend in your model YAML and optionally pick quantized KV-cache types:

name: my-model
backend: turboquant
parameters:
  # Relative to the models path
  model: file.gguf
# Use TurboQuant's own KV-cache quantization schemes. The fork accepts
# the standard llama.cpp types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1)
# and adds three TurboQuant-specific ones: turbo2, turbo3, turbo4.
# turbo3 / turbo4 auto-enable flash_attention (required for turbo K/V)
# and offer progressively more aggressive compression.
cache_type_k: turbo3
cache_type_v: turbo3
context_size: 8192

The cache_type_k / cache_type_v fields map to llama.cpp's -ctk / -ctv flags. The stock llama-cpp backend only accepts the standard llama.cpp types — to use turbo2 / turbo3 / turbo4 you need this turboquant backend, which is where the fork's TurboQuant code paths actually take effect. Pick q8_0 here and you're just running stock llama.cpp KV quantization; pick turbo* and you're running TurboQuant.

Reference

vLLM

vLLM is a fast and easy-to-use library for LLM inference.

LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.

Setup

Create a YAML file for the model you want to use with vllm.

To setup a model, you need to just specify the model name in the YAML config file:

name: vllm
backend: vllm
parameters:
    model: "facebook/opt-125m"

The backend will automatically download the required files in order to run the model.

Usage

Use the completions endpoint by specifying the vllm backend:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "vllm",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Transformers

Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.

LocalAI has a built-in integration with Transformers, and it can be used to run models.

This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.

Setup

Create a YAML file for the model you want to use with transformers.

To setup a model, you need to just specify the model name in the YAML config file:

name: transformers
backend: transformers
parameters:
    model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)

The backend will automatically download the required files in order to run the model.

Parameters

Type

Type	Description
`AutoModelForCausalLM`	`AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration
`OVModelForCausalLM`	for Intel CPU/GPU/NPU OpenVINO Text Generation models
`OVModelForFeatureExtraction`	for Intel CPU/GPU/NPU OpenVINO Embedding acceleration
N/A	Defaults to `AutoModel`

OVModelForCausalLM requires OpenVINO IR Text Generation models from Hugging face
OVModelForFeatureExtraction works with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)

Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU. AMD GPU support is not implemented. Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.

Embeddings

Use embeddings: true if the model is an embedding model

Inference device selection

Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.

Inference Engine	Applicable Values
CUDA	`cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output
OpenVINO	Any applicable value from Inference Modes like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO`

Example for CUDA: main_gpu: cuda.0

Example for OpenVINO: main_gpu: AUTO:-CPU

This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.

Inference Precision

Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:

f16: true

Quantization

Quantization	Description
`bnb_8bit`	8-bit quantization
`bnb_4bit`	4-bit quantization
`xpu_8bit`	8-bit quantization for Intel XPUs
`xpu_4bit`	4-bit quantization for Intel XPUs

Trust Remote Code

Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library. By default it is disabled for security. It can be manually enabled with: trust_remote_code: true

Maximum Context Size

Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.

Usage example: context_size: 8192

Auto Prompt Template

Usually chat template is defined by the model author in the tokenizer_config.json file. To enable it use the use_tokenizer_template: true parameter in the template section.

Usage example:

template:
  use_tokenizer_template: true

Custom Stop Words

Stopwords are usually defined in tokenizer_config.json file. They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.

Usage example:

stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"

Usage

Use the completions endpoint by specifying the transformers model:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{   
   "model": "transformers",
   "prompt": "Hello, my name is",
   "temperature": 0.1, "top_p": 0.1
 }'

Examples

OpenVINO

A model configuration file for openvion and starling model:

name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

28 KiB Raw Blame History

API Reference

Chat completions

Edit completions

Completions

List models

Anthropic Messages API

Basic Usage

Request Parameters

Message Format

Tool Calling

Streaming

Response Format

Open Responses API

Basic Usage

Request Parameters

Input Format

Background Processing

Retrieving Background Responses

Canceling Background Responses

Tool Calling

Reasoning Configuration

Response Format

Backends

RWKV

llama.cpp

Features

Setup

Manual setup

Automatic setup

YAML configuration

Backend Options

Reference

ik_llama.cpp

Features

Setup

YAML configuration

Reference

turboquant (llama.cpp fork with TurboQuant KV-cache)

Features

Setup

YAML configuration

Reference

vLLM

Setup

Usage

Transformers

Setup

Parameters

Type

Embeddings

Inference device selection

Inference Precision

Quantization

Trust Remote Code

Maximum Context Size

Auto Prompt Template

Custom Stop Words

Usage

Examples

OpenVINO

28 KiB

Raw Blame History