* feat(vllm): build vllm from source for Intel XPU
Upstream publishes no XPU wheels for vllm. The Intel profile was
silently picking up a non-XPU wheel that imported but errored at
engine init, and several runtime deps (pillow, charset-normalizer,
chardet) were missing on Intel -- backend.py crashed at import time
before the gRPC server came up.
Switch the Intel profile to upstream's documented from-source
procedure (docs/getting_started/installation/gpu.xpu.inc.md in
vllm-project/vllm):
- Bump portable Python to 3.12 -- vllm-xpu-kernels ships only a
cp312 wheel.
- Source /opt/intel/oneapi/setvars.sh so vllm's CMake build sees
the dpcpp/sycl compiler from the oneapi-basekit base image.
- Hide requirements-intel-after.txt during installRequirements
(it used to 'pip install vllm'); install vllm's deps from a
fresh git clone of vllm via 'uv pip install -r
requirements/xpu.txt', swap stock triton for
triton-xpu==3.7.0, then 'VLLM_TARGET_DEVICE=xpu uv pip install
--no-deps .'.
- requirements-intel.txt trimmed to LocalAI's direct deps
(accelerate / transformers / bitsandbytes); torch-xpu, vllm,
vllm_xpu_kernels and the rest come from upstream's xpu.txt
during the source build.
- requirements.txt: add pillow + charset-normalizer + chardet --
used by backend.py and missing on the Intel install profile.
- run.sh: 'set -x' so backend startup is visible in container
logs (the gRPC startup error path was previously opaque).
Also adds a one-line docs example for engine_args.attention_backend
under the vLLM section, since older XE-HPG GPUs (e.g. Arc A770)
need TRITON_ATTN to bypass the cutlass path in vllm_xpu_kernels.
Tested end-to-end on an Intel Arc A770 with Qwen2.5-0.5B-Instruct
via LocalAI's /v1/chat/completions.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): add multi-node data-parallel follower worker
vLLM v1's multi-node story is one process per node sharing a DP
coordinator over ZMQ -- the head runs the API server with
data_parallel_size > 1 and followers run `vllm serve --headless ...`
with matching topology. Today LocalAI can already configure DP on the
head via the engine_args YAML map, but there's no way to bring up the
follower nodes -- so the head sits waiting for ranks that never
handshake.
Add `local-ai p2p-worker vllm`, mirroring MLXDistributed's structural
precedent (operator-launched, static config, no NATS placement). The
worker:
- Optionally self-registers with the frontend as an agent-type node
tagged `node.role=vllm-follower` so it's visible in the admin UI
and operators can scope ordinary models away via inverse
selectors.
- Resolves the platform-specific vllm backend via the gallery's
"vllm" meta-entry (cuda*, intel-vllm, rocm-vllm, ...).
- Runs vLLM as a child process so the heartbeat goroutine survives
until vLLM exits; forwards SIGINT/SIGTERM so vLLM can clean up its
ZMQ sockets before we tear down.
- Validates --headless + --start-rank 0 is rejected (rank 0 is the
head and must serve the API).
Backend run.sh dispatches `serve` as the first arg to vllm's own CLI
instead of LocalAI's backend.py gRPC server -- the follower speaks
ZMQ directly to the head, there is no LocalAI gRPC on the follower
side. Single-node usage is unchanged.
Generalises the gallery resolution helper into findBackendPath()
shared by MLX and vLLM workers; extracts ParseNodeLabels for the
comma-separated label parsing both use.
Ships with two compose recipes (`docker-compose.vllm-multinode.yaml`
for NVIDIA, `docker-compose.vllm-multinode.intel.yaml` for Intel
XPU/xccl) plus `tests/e2e/vllm-multinode/smoke.sh`. Both vendors are
supported (NCCL for CUDA/ROCm, xccl for XPU) but mixed-vendor DP is
not -- PyTorch's process group requires every rank to use the same
collective backend, and NCCL/xccl/gloo don't interoperate.
Out of scope (deferred): SmartRouter-driven placement of follower
ranks via NATS backend.install events, follower log streaming through
/api/backend-logs, tensor-parallel across nodes, disaggregated
prefill via KVTransferConfig.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* test(vllm): CPU-only end-to-end test for multi-node DP
Adds tests/e2e/vllm-multinode/, a Ginkgo + testcontainers-go suite
that brings up a head + headless follower from the locally-built
local-ai:tests image, bind-mounts the cpu-vllm backend extracted by
make extract-backend-vllm so it's seen as a system backend (no gallery
fetch, no registry server), and asserts a chat completion across both
DP ranks. New `make test-e2e-vllm-multinode` target wires the docker
build, backend extract, and ginkgo run together; BuildKit caches both
images so re-runs only rebuild what changed. Tagged Label("VLLMMultinode")
so the existing distributed suite isn't pulled along.
Two pre-existing bugs surfaced by the test:
1. extract-backend-% (Makefile) failed for every backend, because all
backend images end with `FROM scratch` and `docker create` rejects
an image with no CMD/ENTRYPOINT. Fixed by passing
--entrypoint=/run.sh -- the container is never started, only
docker-cp'd, so the path doesn't have to exist; we just need
anything that satisfies the daemon's create-time validation.
2. backend/python/vllm/run.sh's `serve` shortcut for the multi-node DP
follower exec'd ${EDIR}/venv/bin/vllm directly, but uv bakes an
absolute build-time shebang (`#!/vllm/venv/bin/python3`) that no
longer resolves once the backend is relocated to BackendsPath.
_makeVenvPortable's shebang rewriter only matches paths that
already point at ${EDIR}, so the original shebang slips through
unchanged. Fixed by exec-ing ${EDIR}/venv/bin/python with the script
as an argument -- Python ignores the script's shebang in that case.
The test fixture caps memory aggressively (max_model_len=512,
VLLM_CPU_KVCACHE_SPACE=1, TORCH_COMPILE_DISABLE=1) so two CPU engines
fit on a 32 GB box. TORCH_COMPILE_DISABLE is currently mandatory for
cpu-vllm: torch._inductor's CPU-ISA probe runs even with
enforce_eager=True and needs g++ on PATH, which the LocalAI runtime
image doesn't ship -- to be addressed in a follow-up that bundles a
toolchain in the cpu-vllm backend.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* feat(vllm): bundle a g++ toolchain in the cpu-vllm backend image
torch._inductor's CPU-ISA probe (`cpu_model_runner.py:65 "Warming up
model for the compilation"`) shells out to `g++` at vllm engine
startup, regardless of `enforce_eager=True` -- the eager flag only
disables CUDA graphs, not inductor's first-batch warmup. The LocalAI
CPU runtime image (Dockerfile, unconditional apt list) does not ship
build-essential, and the cpu-vllm backend image is `FROM scratch`,
so any non-trivial inference on cpu-vllm crashes with:
torch._inductor.exc.InductorError:
InvalidCxxCompiler: No working C++ compiler found in
torch._inductor.config.cpp.cxx: (None, 'g++')
Bundling the toolchain in the CPU runtime image would bloat every
non-vllm-CPU deployment and force a single GCC version on backends
that may want clang or a different version. So this lives in the
backend, gated to BUILD_TYPE=='' (the CPU profile).
`package.sh` snapshots g++ + binutils + cc1plus + libstdc++ + libc6
(runtime + dev) + the math libs cc1plus links (libisl/libmpc/libmpfr/
libjansson) into ${BACKEND}/toolchain/, mirroring /usr/... layout. The
unversioned binaries on Debian/Ubuntu are symlink chains pointing into
multiarch packages (`g++` -> `g++-13` -> `x86_64-linux-gnu-g++-13`,
the latter in `g++-13-x86-64-linux-gnu`), so the package list resolves
both the version and the arch-triplet variant. Symlinks /lib ->
usr/lib and /lib64 -> usr/lib64 are recreated under the toolchain
root because Ubuntu's UsrMerge keeps them at /, and ld scripts
(`libc.so`, `libm.so`) hardcode `/lib/...` paths that --sysroot
re-roots into the toolchain.
The unversioned `g++`/`gcc`/`cpp` symlinks are replaced with wrapper
shell scripts that resolve their own location at runtime and pass
`--sysroot=<toolchain>` and `-B <toolchain>/usr/lib/gcc/<triplet>/<ver>/`
to the underlying versioned binary. That's how torch's bare `g++ foo.cpp
-o foo` invocation finds cc1plus (-B), system headers (--sysroot), and
the bundled libstdc++ (--sysroot, --sysroot is recursive into linker).
`run.sh` adds the toolchain bin dir to PATH and the toolchain's
shared-lib dir to LD_LIBRARY_PATH -- everything else (header search,
linker search, executable search) is encapsulated in the wrappers.
No-op for non-CPU builds, the dir doesn't exist there.
The cpu-vllm image grows by ~217 MB. Tradeoff is acceptable -- cpu-vllm
is already a niche profile (few users compared to GPU vllm) and the
alternative is a backend that crashes at first inference unless the
operator manually sets TORCH_COMPILE_DISABLE=1, which silently disables
all torch.compile optimizations.
Drops `TORCH_COMPILE_DISABLE=1` from tests/e2e/vllm-multinode -- the
smoke now exercises the real compile path through the bundled toolchain.
Test runtime is +20s for the warmup compile, still <90s end to end.
Assisted-by: Claude:claude-opus-4-7 [Claude Code]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
* fix(vllm): scope jetson-ai-lab index to L4T-specific wheels via pyproject.toml
The L4T arm64 build resolves dependencies through pypi.jetson-ai-lab.io,
which hosts the L4T-specific torch / vllm / flash-attn wheels but also
transparently proxies the rest of PyPI through `/+f/<sha>/<filename>`
URLs. With `--extra-index-url` + `--index-strategy=unsafe-best-match`
uv would pick those proxy URLs for ordinary PyPI packages —
anthropic/openai/propcache/annotated-types — and fail when the proxy
503s. Master is hitting the same bug on its own l4t-vllm matrix entry.
Switch the l4t13 install path to a pyproject.toml that marks the
jetson-ai-lab index `explicit = true` and pins only torch, torchvision,
torchaudio, flash-attn, and vllm to it via [tool.uv.sources]. uv won't
consult the L4T mirror for anything else, so transitive deps fall back
to PyPI as the default index — no exposure to the proxy 503s.
`uv pip install -r requirements.txt` ignores [tool.uv.sources], so the
l4t13 branch in install.sh now invokes `uv pip install --requirement
pyproject.toml` directly, replacing the old requirements-l4t13*.txt
files. Other BUILD_PROFILEs continue using libbackend.sh's
installRequirements and never read pyproject.toml.
Local resolution test (x86_64, dry-run) confirms uv hits the L4T
index for torch and falls through to PyPI for everything else.
Assisted-by: claude-code:claude-opus-4-7-1m [Read] [Edit] [Bash] [Write]
Signed-off-by: Richard Palethorpe <io@richiejp.com>
---------
Signed-off-by: Richard Palethorpe <io@richiejp.com>
30 KiB
+++ disableToc = false title = "Text Generation (GPT)" weight = 10 url = "/features/text-generation/" +++
LocalAI supports generating text with GPT with llama.cpp and other backends (such as rwkv.cpp as ) see also the [Model compatibility]({{%relref "reference/compatibility-table" %}}) for an up-to-date list of the supported model families.
Note:
- You can also specify the model name as part of the OpenAI token.
- If only one model is available, the API will use it for all the requests.
API Reference
Chat completions
https://platform.openai.com/docs/api-reference/chat
For example, to generate a chat completion, you can send a POST request to the /v1/chat/completions endpoint with the instruction as the request body:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens
Edit completions
https://platform.openai.com/docs/api-reference/edits
To generate an edit completion you can send a POST request to the /v1/edits endpoint with the instruction as the request body:
curl http://localhost:8080/v1/edits -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"instruction": "rephrase",
"input": "Black cat jumped out of the window",
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens.
Completions
https://platform.openai.com/docs/api-reference/completions
To generate a completion, you can send a POST request to the /v1/completions endpoint with the instruction as per the request body:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"prompt": "A long time ago in a galaxy far, far away",
"temperature": 0.7
}'
Available additional parameters: top_p, top_k, max_tokens
List models
You can list all the models available with:
curl http://localhost:8080/v1/models
Anthropic Messages API
LocalAI supports the Anthropic Messages API, which is compatible with Claude clients. This endpoint provides a structured way to send messages and receive responses, with support for tools, streaming, and multimodal content.
Endpoint: POST /v1/messages or POST /messages
Reference: https://docs.anthropic.com/claude/reference/messages_post
Basic Usage
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"max_tokens": 1024,
"messages": [
{"role": "user", "content": "Say this is a test!"}
]
}'
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | The model identifier |
messages |
array | Yes | Array of message objects with role and content |
max_tokens |
integer | Yes | Maximum number of tokens to generate (must be > 0) |
system |
string | No | System message to set the assistant's behavior |
temperature |
float | No | Sampling temperature (0.0 to 1.0) |
top_p |
float | No | Nucleus sampling parameter |
top_k |
integer | No | Top-k sampling parameter |
stop_sequences |
array | No | Array of strings that will stop generation |
stream |
boolean | No | Enable streaming responses |
tools |
array | No | Array of tool definitions for function calling |
tool_choice |
string/object | No | Tool choice strategy: "auto", "any", "none", or specific tool |
metadata |
object | No | Per-request metadata passed to the backend (e.g., {"enable_thinking": "true"}) |
Message Format
Messages can contain text or structured content blocks:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image?"
},
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "base64_encoded_image_data"
}
}
]
}
]
}'
Tool Calling
The Anthropic API supports function calling through tools:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"max_tokens": 1024,
"tools": [
{
"name": "get_weather",
"description": "Get the current weather",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state"
}
},
"required": ["location"]
}
}
],
"tool_choice": "auto",
"messages": [
{"role": "user", "content": "What is the weather in San Francisco?"}
]
}'
Streaming
Enable streaming responses by setting stream: true:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Tell me a story"}
]
}'
Streaming responses use Server-Sent Events (SSE) format with event types: message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop.
Response Format
{
"id": "msg_abc123",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "This is a test!"
}
],
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"stop_reason": "end_turn",
"usage": {
"input_tokens": 10,
"output_tokens": 5
}
}
Open Responses API
LocalAI supports the Open Responses API specification, which provides a standardized interface for AI model interactions with support for background processing, streaming, tool calling, and advanced features like reasoning.
Endpoint: POST /v1/responses or POST /responses
Reference: https://www.openresponses.org/specification
Basic Usage
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"input": "Say this is a test!",
"max_output_tokens": 1024
}'
Request Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | The model identifier |
input |
string/array | Yes | Input text or array of input items |
max_output_tokens |
integer | No | Maximum number of tokens to generate |
temperature |
float | No | Sampling temperature |
top_p |
float | No | Nucleus sampling parameter |
instructions |
string | No | System instructions |
tools |
array | No | Array of tool definitions |
tool_choice |
string/object | No | Tool choice: "auto", "required", "none", or specific tool |
stream |
boolean | No | Enable streaming responses |
background |
boolean | No | Run request in background (returns immediately) |
store |
boolean | No | Whether to store the response |
reasoning |
object | No | Reasoning configuration with effort and summary |
parallel_tool_calls |
boolean | No | Allow parallel tool calls |
max_tool_calls |
integer | No | Maximum number of tool calls |
presence_penalty |
float | No | Presence penalty (-2.0 to 2.0) |
frequency_penalty |
float | No | Frequency penalty (-2.0 to 2.0) |
top_logprobs |
integer | No | Number of top logprobs to return |
truncation |
string | No | Truncation mode: "auto" or "disabled" |
text_format |
object | No | Text format configuration |
metadata |
object | No | Custom metadata |
Input Format
Input can be a simple string or an array of structured items:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"input": [
{
"type": "message",
"role": "user",
"content": "What is the weather?"
}
],
"max_output_tokens": 1024
}'
Background Processing
Run requests in the background for long-running tasks:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"input": "Generate a long story",
"max_output_tokens": 4096,
"background": true
}'
The response will include a response ID that can be used to poll for completion:
{
"id": "resp_abc123",
"object": "response",
"status": "in_progress",
"created_at": 1234567890
}
Retrieving Background Responses
Use the GET endpoint to retrieve background responses:
# Get response by ID
curl http://localhost:8080/v1/responses/resp_abc123
# Resume streaming with query parameters
curl "http://localhost:8080/v1/responses/resp_abc123?stream=true&starting_after=10"
Canceling Background Responses
Cancel a background response that's still in progress:
curl -X POST http://localhost:8080/v1/responses/resp_abc123/cancel
Tool Calling
Open Responses API supports function calling with tools:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"input": "What is the weather in San Francisco?",
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state"
}
},
"required": ["location"]
}
}
],
"tool_choice": "auto",
"max_output_tokens": 1024
}'
Reasoning Configuration
Configure reasoning effort and summary style:
curl http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"input": "Solve this complex problem step by step",
"reasoning": {
"effort": "high",
"summary": "detailed"
},
"max_output_tokens": 2048
}'
Response Format
{
"id": "resp_abc123",
"object": "response",
"created_at": 1234567890,
"completed_at": 1234567895,
"status": "completed",
"model": "ggml-koala-7b-model-q4_0-r2.bin",
"output": [
{
"type": "message",
"id": "msg_001",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "This is a test!",
"annotations": [],
"logprobs": []
}
],
"status": "completed"
}
],
"error": null,
"incomplete_details": null,
"temperature": 0.7,
"top_p": 1.0,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"usage": {
"input_tokens": 10,
"output_tokens": 5,
"total_tokens": 15,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens_details": {
"reasoning_tokens": 0
}
}
}
Backends
RWKV
RWKV support is available through llama.cpp (see below)
llama.cpp
llama.cpp is a popular port of Facebook's LLaMA model in C/C++.
{{% notice note %}}
The ggml file format has been deprecated. If you are using ggml models and you are configuring your model with a YAML file, specify, use a LocalAI version older than v2.25.0. For gguf models, use the llama backend. The go backend is deprecated as well but still available as go-llama.
{{% /notice %}}
Features
The llama.cpp model supports the following features:
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- [🔥 OpenAI functions]({{%relref "features/openai-functions" %}})
- [✍️ Constrained grammars]({{%relref "features/constrained_grammars" %}})
Setup
LocalAI supports llama.cpp models out of the box. You can use the llama.cpp model in the same way as any other model.
Manual setup
It is sufficient to copy the ggml or gguf model files in the models folder. You can refer to the model in the model parameter in the API calls.
[You can optionally create an associated YAML]({{%relref "advanced" %}}) model config file to tune the model's parameters or apply a template to the prompt.
Prompt templates are useful for models that are fine-tuned towards a specific prompt.
Automatic setup
LocalAI supports model galleries which are indexes of models. For instance, the huggingface gallery contains a large curated index of models from the huggingface model hub for ggml or gguf models.
For instance, if you have the galleries enabled and LocalAI already running, you can just start chatting with models in huggingface by running:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheBloke/WizardLM-13B-V1.2-GGML/wizardlm-13b-v1.2.ggmlv3.q2_K.bin",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.1
}'
LocalAI will automatically download and configure the model in the model directory.
Models can be also preloaded or downloaded on demand. To learn about model galleries, check out the [model gallery documentation]({{%relref "features/model-gallery" %}}).
YAML configuration
To use the llama.cpp backend, specify llama-cpp as the backend in the YAML file:
name: llama
backend: llama-cpp
parameters:
# Relative to the models path
model: file.gguf
Backend Options
The llama.cpp backend supports additional configuration options that can be specified in the options field of your model YAML configuration. These options allow fine-tuning of the backend behavior:
| Option | Type | Description | Example |
|---|---|---|---|
use_jinja or jinja |
boolean | Enable Jinja2 template processing for chat templates. When enabled, the backend uses Jinja2-based chat templates from the model for formatting messages. | use_jinja:true |
context_shift |
boolean | Enable context shifting, which allows the model to dynamically adjust context window usage. | context_shift:true |
cache_ram |
integer | Set the maximum RAM cache size in MiB for KV cache. Use -1 for unlimited (default). |
cache_ram:2048 |
parallel or n_parallel |
integer | Enable parallel request processing. When set to a value greater than 1, enables continuous batching for handling multiple requests concurrently. | parallel:4 |
grpc_servers or rpc_servers |
string | Comma-separated list of gRPC server addresses for distributed inference. Allows distributing workload across multiple llama.cpp workers. | grpc_servers:localhost:50051,localhost:50052 |
fit_params or fit |
boolean | Enable auto-adjustment of model/context parameters to fit available device memory. Default: true. |
fit_params:true |
fit_params_target or fit_target |
integer | Target margin per device in MiB when using fit_params. Default: 1024 (1GB). |
fit_target:2048 |
fit_params_min_ctx or fit_ctx |
integer | Minimum context size that can be set by fit_params. Default: 4096. |
fit_ctx:2048 |
n_cache_reuse or cache_reuse |
integer | Minimum chunk size to attempt reusing from the cache via KV shifting. Default: 0 (disabled). |
cache_reuse:256 |
slot_prompt_similarity or sps |
float | How much the prompt of a request must match the prompt of a slot to use that slot. Default: 0.1. Set to 0 to disable. |
sps:0.5 |
swa_full |
boolean | Use full-size SWA (Sliding Window Attention) cache. Default: false. |
swa_full:true |
cont_batching or continuous_batching |
boolean | Enable continuous batching for handling multiple sequences. Default: true. |
cont_batching:true |
check_tensors |
boolean | Validate tensor data for invalid values during model loading. Default: false. |
check_tensors:true |
warmup |
boolean | Enable warmup run after model loading. Default: true. |
warmup:false |
no_op_offload |
boolean | Disable offloading host tensor operations to device. Default: false. |
no_op_offload:true |
kv_unified or unified_kv |
boolean | Enable unified KV cache. Default: false. |
kv_unified:true |
n_ctx_checkpoints or ctx_checkpoints |
integer | Maximum number of context checkpoints per slot. Default: 8. |
ctx_checkpoints:4 |
split_mode or sm |
string | How to split the model across multiple GPUs: none (single GPU only), layer (default — split layers and KV across GPUs), row (split rows across GPUs), tensor (experimental tensor parallelism — requires flash_attention: true, no KV-cache quantization, manually set context_size, and a llama.cpp build that includes #19378). |
split_mode:tensor |
Example configuration with options:
name: llama-model
backend: llama
parameters:
model: model.gguf
options:
- use_jinja:true
- context_shift:true
- cache_ram:4096
- parallel:2
- fit_params:true
- fit_target:1024
- slot_prompt_similarity:0.5
Note: The parallel option can also be set via the LLAMACPP_PARALLEL environment variable, and grpc_servers can be set via the LLAMACPP_GRPC_SERVERS environment variable. Options specified in the YAML file take precedence over environment variables.
Reference
ik_llama.cpp
ik_llama.cpp is a hard fork of llama.cpp by Iwan Kawrakow that focuses on superior CPU and hybrid GPU/CPU performance. It ships additional quantization types (IQK quants), custom quantization mixes, Multi-head Latent Attention (MLA) for DeepSeek models, and fine-grained tensor offload controls — particularly useful for running very large models on commodity CPU hardware.
{{% notice note %}}
The ik-llama-cpp backend requires a CPU with AVX2 support. The IQK kernels are not compatible with older CPUs.
{{% /notice %}}
Features
The ik-llama-cpp backend supports the following features:
- [📖 Text generation (GPT)]({{%relref "features/text-generation" %}})
- [🧠 Embeddings]({{%relref "features/embeddings" %}})
- IQK quantization types for better CPU inference performance
- Multimodal models (via clip/llava)
Setup
The backend is distributed as a separate container image and can be installed from the LocalAI backend gallery, or specified directly in a model configuration. GGUF models loaded with this backend benefit from ik_llama.cpp's optimized CPU kernels — especially useful for MoE models and large quantized models that would otherwise be GPU-bound.
YAML configuration
To use the ik-llama-cpp backend, specify it as the backend in the YAML file:
name: my-model
backend: ik-llama-cpp
parameters:
# Relative to the models path
model: file.gguf
The aliases ik-llama and ik_llama are also accepted.
Reference
turboquant (llama.cpp fork with TurboQuant KV-cache)
llama-cpp-turboquant is a llama.cpp fork that adds the TurboQuant KV-cache quantization scheme. It reuses the upstream llama.cpp codebase and ships as a drop-in alternative backend inside LocalAI, sharing the same gRPC server sources as the stock llama-cpp backend — so any GGUF model that runs on llama-cpp also runs on turboquant.
You would pick turboquant when you want smaller KV-cache memory pressure (longer contexts on the same VRAM) or to experiment with the fork's quantized KV representations on top of the standard cache_type_k / cache_type_v knobs already supported by upstream llama.cpp.
Features
- Drop-in GGUF compatibility with upstream
llama.cpp. - TurboQuant KV-cache quantization (see fork README for the current set of accepted
cache_type_k/cache_type_vvalues). - Same feature surface as the
llama-cppbackend: text generation, embeddings, tool calls, multimodal via mmproj. - Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T.
Setup
turboquant ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
local-ai backends install turboquant
Or pick a specific flavor for your hardware (example tags: cpu-turboquant, cuda12-turboquant, cuda13-turboquant, rocm-turboquant, intel-sycl-f16-turboquant, vulkan-turboquant).
YAML configuration
To run a model with turboquant, set the backend in your model YAML and optionally pick quantized KV-cache types:
name: my-model
backend: turboquant
parameters:
# Relative to the models path
model: file.gguf
# Use TurboQuant's own KV-cache quantization schemes. The fork accepts
# the standard llama.cpp types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1)
# and adds three TurboQuant-specific ones: turbo2, turbo3, turbo4.
# turbo3 / turbo4 auto-enable flash_attention (required for turbo K/V)
# and offer progressively more aggressive compression.
cache_type_k: turbo3
cache_type_v: turbo3
context_size: 8192
The cache_type_k / cache_type_v fields map to llama.cpp's -ctk / -ctv flags. The stock llama-cpp backend only accepts the standard llama.cpp types — to use turbo2 / turbo3 / turbo4 you need this turboquant backend, which is where the fork's TurboQuant code paths actually take effect. Pick q8_0 here and you're just running stock llama.cpp KV quantization; pick turbo* and you're running TurboQuant.
Reference
vLLM
vLLM is a fast and easy-to-use library for LLM inference.
LocalAI has a built-in integration with vLLM, and it can be used to run models. You can check out vllm performance here.
Setup
Create a YAML file for the model you want to use with vllm.
To setup a model, you need to just specify the model name in the YAML config file:
name: vllm
backend: vllm
parameters:
model: "facebook/opt-125m"
The backend will automatically download the required files in order to run the model.
Usage
Use the completions endpoint by specifying the vllm backend:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Passing arbitrary vLLM options with engine_args
A subset of AsyncEngineArgs is exposed as typed YAML fields
(tensor_parallel_size, gpu_memory_utilization, quantization,
max_model_len, dtype, trust_remote_code, enforce_eager, …).
Anything else can be passed through the generic engine_args: map.
Keys are forwarded verbatim to vLLM's engine; unknown keys fail at load
time with the closest valid name as a hint. Nested maps materialise
into vLLM's nested config dataclasses (SpeculativeConfig,
KVTransferConfig, CompilationConfig, …).
Speculative decoding (DFlash, ngram, eagle, deepseek_mtp, …) is configured this way:
name: qwen3.5-4b-dflash
backend: vllm
parameters:
model: Qwen/Qwen3.5-4B
context_size: 8192
max_model_len: 8192
trust_remote_code: true
quantization: fp8
template:
use_tokenizer_template: true
engine_args:
speculative_config:
method: dflash
model: z-lab/Qwen3.5-4B-DFlash
num_speculative_tokens: 15
The shape of speculative_config follows vLLM's
SpeculativeConfig
— method picks the algorithm, the remaining keys are method-specific.
Drafters from z-lab are paired with
specific target models; pick the one that matches your target. The
drafter loads in its native precision regardless of the target's
quantization: setting.
Another example — picking a non-default attention backend (e.g. on hardware where the default cutlass kernels aren't supported):
engine_args:
attention_backend: TRITON_ATTN
Multi-node data parallelism
engine_args.data_parallel_size > 1 combined with the
local-ai p2p-worker vllm follower lets a single model span multiple
GPU nodes. See [vLLM Multi-Node (Data-Parallel)]({{% relref
"/docs/features/distributed-mode#vllm-multi-node-data-parallel" %}})
for the head/follower configuration and a worked Kimi-K2.6 example.
Transformers
Transformers is a State-of-the-art Machine Learning library for PyTorch, TensorFlow, and JAX.
LocalAI has a built-in integration with Transformers, and it can be used to run models.
This is an extra backend - in the container images (the extra images already contains python dependencies for Transformers) is already available and there is nothing to do for the setup.
Setup
Create a YAML file for the model you want to use with transformers.
To setup a model, you need to just specify the model name in the YAML config file:
name: transformers
backend: transformers
parameters:
model: "facebook/opt-125m"
type: AutoModelForCausalLM
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
The backend will automatically download the required files in order to run the model.
Parameters
Type
| Type | Description |
|---|---|
AutoModelForCausalLM |
AutoModelForCausalLM is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
OVModelForCausalLM |
for Intel CPU/GPU/NPU OpenVINO Text Generation models |
OVModelForFeatureExtraction |
for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
| N/A | Defaults to AutoModel |
OVModelForCausalLMrequires OpenVINO IR Text Generation models from Hugging faceOVModelForFeatureExtractionworks with any Safetensors Transformer Feature Extraction model from Huggingface (Embedding Model)
Please note that streaming is currently not implemente in AutoModelForCausalLM for Intel GPU.
AMD GPU support is not implemented.
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
Embeddings
Use embeddings: true if the model is an embedding model
Inference device selection
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the main_gpu parameter.
| Inference Engine | Applicable Values |
|---|---|
| CUDA | cuda, cuda.X where X is the GPU device like in nvidia-smi -L output |
| OpenVINO | Any applicable value from Inference Modes like AUTO,CPU,GPU,NPU,MULTI,HETERO |
Example for CUDA:
main_gpu: cuda.0
Example for OpenVINO:
main_gpu: AUTO:-CPU
This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
Inference Precision
Transformer backend automatically select the fastest applicable inference precision according to the device support. CUDA backend can manually enable bfloat16 if your hardware support it with the following parameter:
f16: true
Quantization
| Quantization | Description |
|---|---|
bnb_8bit |
8-bit quantization |
bnb_4bit |
4-bit quantization |
xpu_8bit |
8-bit quantization for Intel XPUs |
xpu_4bit |
4-bit quantization for Intel XPUs |
Trust Remote Code
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
By default it is disabled for security.
It can be manually enabled with:
trust_remote_code: true
Maximum Context Size
Maximum context size in bytes can be specified with the parameter: context_size. Do not use values higher than what your model support.
Usage example:
context_size: 8192
Auto Prompt Template
Usually chat template is defined by the model author in the tokenizer_config.json file.
To enable it use the use_tokenizer_template: true parameter in the template section.
Usage example:
template:
use_tokenizer_template: true
Custom Stop Words
Stopwords are usually defined in tokenizer_config.json file.
They can be overridden with the stopwords parameter in case of need like in llama3-Instruct model.
Usage example:
stopwords:
- "<|eot_id|>"
- "<|end_of_text|>"
Usage
Use the completions endpoint by specifying the transformers model:
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
"model": "transformers",
"prompt": "Hello, my name is",
"temperature": 0.1, "top_p": 0.1
}'
Examples
OpenVINO
A model configuration file for openvion and starling model:
name: starling-openvino
backend: transformers
parameters:
model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
chat_message: |
{{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}
chat: |
{{.Input}}<|end_of_turn|>GPT4 Correct Assistant:
completion: |
{{.Input}}