feat(llama-cpp): expose split_mode option for multi-GPU placement (#9560)

Adds split_mode (alias sm) to the llama.cpp backend options allowlist, accepting none|layer|row|tensor. The tensor value targets the experimental backend-agnostic tensor parallelism from ggml-org/llama.cpp#19378 and requires a llama.cpp build that includes that PR, FlashAttention enabled, KV-cache quantization disabled, and a manually set context size. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-29 11:37:40 -04:00 · 2026-04-25 14:02:57 +02:00
parent 24505e57f5
commit 21eace40ec
2 changed files with 16 additions and 0 deletions
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -514,6 +514,7 @@ The `llama.cpp` backend supports additional configuration options that can be sp
 | `no_op_offload` | boolean | Disable offloading host tensor operations to device. Default: `false`. | `no_op_offload:true` |
 | `kv_unified` or `unified_kv` | boolean | Enable unified KV cache. Default: `false`. | `kv_unified:true` |
 | `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot. Default: `8`. | `ctx_checkpoints:4` |
+| `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism — requires `flash_attention: true`, no KV-cache quantization, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378)). | `split_mode:tensor` |

 **Example configuration with options:**