fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step

Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/f8d7e7db-85d0-42e1-89dc-08222d3aace3 Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>
Initial plan
2026-05-29 19:19:19 -04:00 · 2026-05-25 21:46:54 +00:00 · 2026-05-25 21:34:09 +00:00
2 changed files with 15 additions and 9 deletions
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -570,9 +570,11 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    // kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
    // what actually unlocks it.
    params.cache_idle_slots = true;
-    // checkpoint_every_nt: create a context checkpoint every N tokens during
-    // prefill (-1 disables). Match upstream's default (8192).
-    params.checkpoint_every_nt = 8192;
+    // checkpoint_min_step: minimum spacing between context checkpoints in
+    // tokens (0 disables the minimum). Match upstream's default (256). This
+    // field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
+    // also shifted from a fixed cadence to a minimum spacing.
+    params.checkpoint_min_step = 256;

     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
@@ -746,14 +748,18 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                params.cache_idle_slots = false;
            }

-        // --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
-        // -1 disables checkpointing during prefill.
-        } else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
+        // --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
+        // 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
+        // `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
+        // with existing user configs: upstream renamed the field and shifted its
+        // semantics from a fixed cadence to a minimum spacing.
+        } else if (!strcmp(optname, "checkpoint_min_step") || !strcmp(optname, "checkpoint_min_spacing") ||
+                   !strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
            if (optval != NULL) {
                try {
-                    params.checkpoint_every_nt = std::stoi(optval_str);
+                    params.checkpoint_min_step = std::stoi(optval_str);
                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (8192)
+                    // If conversion fails, keep default value (256)
                }
            }

--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -515,7 +515,7 @@ The `llama.cpp` backend supports additional configuration options that can be sp
 | `kv_unified` or `unified_kv` | boolean | Use a single unified KV buffer shared across all sequences. Default: `true` (LocalAI override; upstream defaults to `false` but auto-enables it when slot count is auto). **Required for `cache_idle_slots` to work**: without it the server force-disables idle-slot saving at init, and the prompt cache is never written across requests. | `kv_unified:false` |
 | `cache_idle_slots` or `idle_slots_cache` | boolean | On a new task, save the previous slot's KV state into the prompt cache (and clear the slot) so a later request with the same prefix can warm-load it. Default: `true`. Auto-disabled by the server if `kv_unified=false` or `cache_ram=0`. | `cache_idle_slots:false` |
 | `n_ctx_checkpoints` or `ctx_checkpoints` | integer | Maximum number of context checkpoints per slot (used for partial-prefix recovery, e.g. SWA). Default: `32`. | `ctx_checkpoints:16` |
-| `checkpoint_every_nt` or `checkpoint_every_n_tokens` | integer | Create a context checkpoint every N tokens during prefill. `-1` disables checkpointing. Default: `8192`. | `checkpoint_every_nt:4096` |
+| `checkpoint_min_step` or `checkpoint_min_spacing` (aliases: `checkpoint_every_nt`, `checkpoint_every_n_tokens`) | integer | Minimum spacing in tokens between context checkpoints. `0` disables the minimum-spacing gate. Default: `256`. (Renamed upstream from `checkpoint_every_nt`; semantics shifted from a fixed cadence to a minimum spacing.) | `checkpoint_min_step:1024` |
 | `split_mode` or `sm` | string | How to split the model across multiple GPUs: `none` (single GPU only), `layer` (default — split layers and KV across GPUs), `row` (split rows across GPUs), `tensor` (experimental tensor parallelism — requires `flash_attention: true`, no KV-cache quantization, manually set `context_size`, and a llama.cpp build that includes [#19378](https://github.com/ggml-org/llama.cpp/pull/19378)). | `split_mode:tensor` |

 **Example configuration with options:**
Author	SHA1	Message	Date
copilot-swe-agent[bot]	20854bc000	fix(llama-cpp): track upstream rename checkpoint_every_nt -> checkpoint_min_step Agent-Logs-Url: https://github.com/mudler/LocalAI/sessions/f8d7e7db-85d0-42e1-89dc-08222d3aace3 Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-05-25 21:46:54 +00:00
copilot-swe-agent[bot]	806de27ae7	Initial plan	2026-05-25 21:34:09 +00:00