From 977ccd88f0fa0ec26050a0e756e28b941997a79d Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Wed, 24 Jun 2026 17:13:13 +0000 Subject: [PATCH] docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough Signed-off-by: Ettore Di Giacinto --- docs/content/advanced/model-configuration.md | 32 ++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/docs/content/advanced/model-configuration.md b/docs/content/advanced/model-configuration.md index 55e435b12..06b516ac0 100644 --- a/docs/content/advanced/model-configuration.md +++ b/docs/content/advanced/model-configuration.md @@ -494,6 +494,38 @@ These llama.cpp options are passed through the `options:` array. | `direct_io` / `use_direct_io` | bool | `false` | Open the model with `O_DIRECT` (faster cold loads on NVMe; ignored if not supported). | | `verbosity` | int | `3` | llama.cpp internal log verbosity threshold. Higher = more verbose. | | `override_tensor` / `tensor_buft_overrides` | string | "" | Per-tensor buffer-type overrides for the main model. Format: `=,=,...`. Mirrors the existing `draft_override_tensor` syntax for the draft model. | +| `cpu_moe` | bool | false | Keep all MoE expert weights of the main model on CPU (upstream `--cpu-moe`). Frees VRAM on large MoE models (DeepSeek, Qwen3 `*-A3B`). | +| `n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N main-model layers on CPU (upstream `--n-cpu-moe`). | + +#### Generic option passthrough + +Any `options:` entry whose name starts with `-` is forwarded **verbatim** to +upstream llama.cpp's own `llama-server` argument parser. This means any flag the +bundled llama.cpp supports works without LocalAI needing a dedicated option, +even ones added after your LocalAI version was built. See the upstream +[server flags reference](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md). + +Format mirrors the rest of the array - `--flag` for a boolean, or `--flag:value` +for a flag that takes a value. Everything after the first `:` is the value, so +embedded colons (e.g. `host:port`) are preserved: + +```yaml +options: + - "--cpu-moe" # boolean flag + - "--n-cpu-moe:4" # flag with a value + - "--override-tensor:exps=CPU" +``` + +Notes: + +- **Precedence:** passthrough flags are applied last, so an explicit flag + overrides the LocalAI option it maps to (e.g. `--ctx-size:8192` overrides + `context_size`). +- **Power-user territory:** an invalid flag or value is rejected by the upstream + parser exactly as it would be by `llama-server`, which can fail model loading. + Prefer the named options above when one exists. +- `--help`, `--usage`, and `--completion*` are ignored (they would terminate the + backend process). ### Prompt Caching