From 595e4487148bc637e45ef3b19289441ea352c156 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Tue, 2 Jun 2026 15:52:23 +0200 Subject: [PATCH] docs(llama.cpp): note tensor split now works with quantized KV cache (#10135) The split_mode: tensor description claimed tensor parallelism requires KV-cache quantization to be disabled. ggml-org/llama.cpp#23792 lifts that restriction by extending the meta backend to preserve shape information through KV-cache flatten/reshape, so cache_type_k/cache_type_v quantization can be combined with -sm tensor on builds that include it. Documentation only: no backend code, grpc-server.cpp comment, or llama.cpp pin changes. Assisted-by: Claude Code:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto --- docs/content/features/text-generation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/content/features/text-generation.md b/docs/content/features/text-generation.md index 709fbaf52..c09717a3f 100644 --- a/docs/content/features/text-generation.md +++ b/docs/content/features/text-generation.md @@ -516,7 +516,7 @@ The `llama.cpp` backend supports additional configuration options that can be sp | `cache_idle_slots` or `idle_slots_cache` | boolean | On a new task, save the previous slot's KV state into the prompt cache (and clear the slot) so a later request with the same prefix can warm-load it. Default: `true`. Auto-disabled by the server if `kv_unified=false` or `cache_ram=0`. | `cache_idle_slots:false` | | `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot (used for partial-prefix recovery, e.g. SWA). Default: `32`. | `ctx_checkpoints:16` | | `checkpoint_min_step` or `checkpoint_min_spacing` (aliases: `checkpoint_every_nt`, `checkpoint_every_n_tokens`) | integer | Minimum spacing in tokens between context checkpoints. `0` disables the minimum-spacing gate. Default: `256`. (Renamed upstream from `checkpoint_every_nt`; semantics shifted from a fixed cadence to a minimum spacing.) | `checkpoint_min_step:1024` | -| `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism — requires `flash_attention: true`, no KV-cache quantization, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378)). | `split_mode:tensor` | +| `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism, requires `flash_attention: true`, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378); it historically also required KV-cache quantization to be disabled, but [#23792](https://github.com/ggml-org/llama.cpp/pull/23792) lifts that restriction so `cache_type_k`/`cache_type_v` quantization can be combined with tensor parallelism on builds that include it). | `split_mode:tensor` | **Example configuration with options:**