From bc4cd3dd85ee74c059da56a0120d2aad07465158 Mon Sep 17 00:00:00 2001
From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com>
Date: Tue, 12 May 2026 17:22:37 +0200
Subject: [PATCH] feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server,
 expose new spec-decoding options (#9765)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
---
 backend/Dockerfile.turboquant                |   6 +
 backend/cpp/llama-cpp/Makefile               |   2 +-
 backend/cpp/llama-cpp/grpc-server.cpp        | 198 ++++++++++++++++++-
 backend/cpp/turboquant/patch-grpc-server.sh  |  43 ++++
 docs/content/advanced/model-configuration.md |  58 +++++-
 5 files changed, 296 insertions(+), 11 deletions(-)
diff --git a/backend/Dockerfile.turboquant b/backend/Dockerfile.turboquant
index ffdccf416..4f3a005ee 100644
--- a/backend/Dockerfile.turboquant
+++ b/backend/Dockerfile.turboquant
@@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH
 ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
 ARG CMAKE_ARGS
 ENV CMAKE_ARGS=${CMAKE_ARGS}
+# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
+# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
+# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
+# time. The builder-fromsource stage above already does this; mirror it here.
+ARG AMDGPU_TARGETS
+ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
 ARG TARGETARCH
 ARG TARGETVARIANT
 
diff --git a/backend/cpp/llama-cpp/Makefile b/backend/cpp/llama-cpp/Makefile
index 62e2b0fb3..cd9a71860 100644
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,5 +1,5 @@
 
-LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
+LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
 
 CMAKE_ARGS?=
diff --git a/backend/cpp/llama-cpp/grpc-server.cpp b/backend/cpp/llama-cpp/grpc-server.cpp
index df3d075e7..5c1d07766 100644
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -36,6 +36,8 @@
 #include <cstdlib>
 #include <fstream>
 #include <iterator>
+#include <list>
+#include <map>
 #include <mutex>
 #include <signal.h>
 #include <thread>
@@ -443,10 +445,22 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
     // Draft model for speculative decoding
     if (!request->draftmodel().empty()) {
         params.speculative.draft.mparams.path = request->draftmodel();
-        // Default to draft type if a draft model is set but no explicit type
+        // Default to draft type if a draft model is set but no explicit type.
+        // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
+        // vector; the turboquant fork still uses the legacy scalar. The
+        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
+        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
         if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
             params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
         }
+#else
+        const bool no_spec_type = params.speculative.types.empty() ||
+            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
+        if (no_spec_type) {
+            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
+        }
+#endif
     }
 
     //  params.model_alias ??
@@ -673,10 +687,35 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
             }
         // Speculative decoding options
         } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
-            auto type = common_speculative_type_from_name(optval_str);
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
+            // Fork only knows a single scalar `type`. Take the first comma-
+            // separated value and assign it via the singular helper.
+            std::string first = optval_str;
+            const auto comma = first.find(',');
+            if (comma != std::string::npos) first = first.substr(0, comma);
+            auto type = common_speculative_type_from_name(first);
             if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
                 params.speculative.type = type;
             }
+#else
+            // Upstream switched to a vector of types (comma-separated for multi-type
+            // chaining via common_speculative_types_from_names). We keep accepting a
+            // single value here, but also tolerate comma-separated lists.
+            std::vector<std::string> names;
+            std::string item;
+            for (char c : optval_str) {
+                if (c == ',') {
+                    if (!item.empty()) { names.push_back(item); item.clear(); }
+                } else {
+                    item.push_back(c);
+                }
+            }
+            if (!item.empty()) names.push_back(item);
+            auto parsed = common_speculative_types_from_names(names);
+            if (!parsed.empty()) {
+                params.speculative.types = parsed;
+            }
+#endif
         } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
             if (optval != NULL) {
                 try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -710,10 +749,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                 try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
             }
         } else if (!strcmp(optname, "draft_ctx_size")) {
-            if (optval != NULL) {
-                try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
-            }
+            // The draft context size is no longer a separate field upstream: the draft
+            // shares the target context size. Accept the option for backward
+            // compatibility but silently ignore it.
+
+// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
+// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
+// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
+// fields. The turboquant fork branched before that, so its build defines
+// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
+// keys become unrecognized (silently dropped, like any unknown opt) for it.
+//
+// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
+// closing-brace position of the `draft_ctx_size` branch on purpose: in the
+// legacy build the chain ends here (the brace closes draft_ctx_size), and in
+// the modern build the chain continues with `} else if (...)` instead, so the
+// brace count stays balanced under both branches of the preprocessor.
+#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
         }
+#else
+        // --- ngram_mod family (upstream --spec-ngram-mod-*) ---
+        } else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
+        } else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+        } else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
+            if (optval != NULL) {
+                try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
+            }
+
+        // --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
+        } else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
+            params.speculative.ngram_cache.lookup_cache_static = optval_str;
+        } else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
+            params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
+
+        // --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
+        } else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
+            params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
+        } else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
+            params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
+
+        // --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
+        } else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
+                    params.speculative.draft.cpuparams.n_threads = n;
+                } catch (...) {}
+            }
+        } else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n <= 0) n = (int)std::thread::hardware_concurrency();
+                    params.speculative.draft.cpuparams_batch.n_threads = n;
+                } catch (...) {}
+            }
+
+        // --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
+        } else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
+            // Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
+            const bool enable = (optval == NULL) ||
+                optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
+                optval_str == "on" || optval_str == "enabled";
+            if (enable) {
+                params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
+            }
+        } else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
+            if (optval != NULL) {
+                try {
+                    int n = std::stoi(optval_str);
+                    if (n < 0) n = 0;
+                    // Keep override-name storage alive for the lifetime of the params struct
+                    // (mirrors upstream arg.cpp behavior with a function-local static).
+                    static std::list<std::string> buft_overrides_draft;
+                    for (int i = 0; i < n; ++i) {
+                        buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
+                        params.speculative.draft.tensor_buft_overrides.push_back(
+                            {buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
+                    }
+                } catch (...) {}
+            }
+
+        // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
+        } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
+            // Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
+            // We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
+            ggml_backend_load_all();
+            std::map<std::string, ggml_backend_buffer_type_t> buft_list;
+            for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
+                auto * dev = ggml_backend_dev_get(i);
+                auto * buft = ggml_backend_dev_buffer_type(dev);
+                if (buft) {
+                    buft_list[ggml_backend_buft_name(buft)] = buft;
+                }
+            }
+            static std::list<std::string> draft_override_names;
+            std::string cur;
+            auto flush = [&](const std::string & spec) {
+                auto pos = spec.find('=');
+                if (pos == std::string::npos) return;
+                const std::string name = spec.substr(0, pos);
+                const std::string type = spec.substr(pos + 1);
+                auto it = buft_list.find(type);
+                if (it == buft_list.end()) return; // unknown buffer type: ignore
+                draft_override_names.push_back(name);
+                params.speculative.draft.tensor_buft_overrides.push_back(
+                    {draft_override_names.back().c_str(), it->second});
+            };
+            for (char c : optval_str) {
+                if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
+                else { cur.push_back(c); }
+            }
+            if (!cur.empty()) flush(cur);
+        }
+#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
     }
 
     // Set params.n_parallel from environment variable if not set via options (fallback)
@@ -2704,7 +2888,7 @@ public:
 
             tasks.reserve(documents.size());
             for (size_t i = 0; i < documents.size(); i++) {
-                auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
+                auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
                 server_task task = server_task(SERVER_TASK_TYPE_RERANK);
                 task.id = rd.queue_tasks.get_new_id();
                 task.index = i;
@@ -2882,7 +3066,7 @@ public:
                 // Get template source and reconstruct a common_chat_template for analysis
                 std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
                 if (!tmpl_src.empty()) {
-                    const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
+                    const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
                     std::string token_bos, token_eos;
                     if (vocab) {
                         auto bos_id = llama_vocab_bos(vocab);
diff --git a/backend/cpp/turboquant/patch-grpc-server.sh b/backend/cpp/turboquant/patch-grpc-server.sh
index a4c2df62c..d071c6156 100755
--- a/backend/cpp/turboquant/patch-grpc-server.sh
+++ b/backend/cpp/turboquant/patch-grpc-server.sh
@@ -108,4 +108,47 @@ else
     echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch"
 fi
 
+# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream
+#    ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still
+#    exposes the field as `model` on `server_context_impl`. The two call sites
+#    are in the Rerank and ModelMetadata RPC handlers.
+if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then
+    echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model"
+    sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> model_tgt rename OK"
+else
+    echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch"
+fi
+
+# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the
+#    grpc-server option parser skips the new option-handler blocks (ngram_mod,
+#    ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*,
+#    draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those
+#    blocks reference struct fields that simply do not exist in the fork.
+if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then
+    echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping"
+else
+    echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top"
+    # Insert the define before the very first `#include` so it precedes all the
+    # speculative-decoding code paths.
+    awk '
+        !done && /^#include/ {
+            print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1"
+            print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh"
+            print ""
+            done = 1
+        }
+        { print }
+        END {
+            if (!done) {
+                print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr"
+                exit 1
+            }
+        }
+    ' "$SRC" > "$SRC.tmp"
+    mv "$SRC.tmp" "$SRC"
+    echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK"
+fi
+
 echo "==> all patches applied"
diff --git a/docs/content/advanced/model-configuration.md b/docs/content/advanced/model-configuration.md
index 172e50b65..02aa555ce 100644
--- a/docs/content/advanced/model-configuration.md
+++ b/docs/content/advanced/model-configuration.md
@@ -251,18 +251,68 @@ options:
 
 These are set via the `options:` array in the model configuration (format: `key:value`):
 
+**Common options**
+
 | Option | Type | Default | Description |
 |--------|------|---------|-------------|
-| `spec_type` | string | `none` | Speculative decoding type (see table below) |
+| `spec_type` / `speculative_type` | string | `none` | Speculative decoding type, or comma-separated list to chain multiple (see table below) |
 | `spec_n_max` / `draft_max` | int | 16 | Maximum number of tokens to draft per step |
 | `spec_n_min` / `draft_min` | int | 0 | Minimum draft tokens required to use speculation |
 | `spec_p_min` / `draft_p_min` | float | 0.75 | Minimum probability threshold for greedy acceptance |
 | `spec_p_split` | float | 0.1 | Split probability for tree-based branching |
+
+**Draft-model options** (apply when `spec_type=draft`, i.e. a `draft_model` is configured)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
+| `draft_threads` / `spec_draft_threads` | int | same as main | Threads used by the draft model (`<= 0` = hardware concurrency) |
+| `draft_threads_batch` / `spec_draft_threads_batch` | int | same as `draft_threads` | Threads used by the draft model during batch / prompt processing |
+| `draft_cache_type_k` / `spec_draft_cache_type_k` | string | `f16` | KV cache K data type for the draft model (same values as `cache_type_k`) |
+| `draft_cache_type_v` / `spec_draft_cache_type_v` | string | `f16` | KV cache V data type for the draft model |
+| `draft_cpu_moe` / `spec_draft_cpu_moe` | bool | false | Keep all MoE expert weights of the draft model on CPU |
+| `draft_n_cpu_moe` / `spec_draft_n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU |
+| `draft_override_tensor` / `spec_draft_override_tensor` | string | "" | Comma-separated `<tensor regex>=<buffer type>` overrides for the draft model |
+| `draft_ctx_size` | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. |
+
+**`ngram_simple` options** (used when `spec_type` includes `ngram_simple`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
 | `spec_ngram_size_n` / `ngram_size_n` | int | 12 | N-gram lookup size |
 | `spec_ngram_size_m` / `ngram_size_m` | int | 48 | M-gram proposal size |
 | `spec_ngram_min_hits` / `ngram_min_hits` | int | 1 | Minimum hits for accepting n-gram proposals |
-| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) |
-| `draft_ctx_size` | int | 0 | Context size for the draft model (0 = auto) |
+
+**`ngram_mod` options** (used when `spec_type` includes `ngram_mod`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_mod_n_min` | int | 48 | Minimum number of ngram tokens to use |
+| `spec_ngram_mod_n_max` | int | 64 | Maximum number of ngram tokens to use |
+| `spec_ngram_mod_n_match` | int | 24 | Ngram lookup length |
+
+**`ngram_map_k` options** (used when `spec_type` includes `ngram_map_k`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_map_k_size_n` | int | 12 | N-gram lookup size |
+| `spec_ngram_map_k_size_m` | int | 48 | M-gram proposal size |
+| `spec_ngram_map_k_min_hits` | int | 1 | Minimum hits for accepting proposals |
+
+**`ngram_map_k4v` options** (used when `spec_type` includes `ngram_map_k4v`)
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_ngram_map_k4v_size_n` | int | 12 | N-gram lookup size |
+| `spec_ngram_map_k4v_size_m` | int | 48 | M-gram proposal size |
+| `spec_ngram_map_k4v_min_hits` | int | 1 | Minimum hits for accepting proposals |
+
+**`ngram_cache` lookup files**
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `spec_lookup_cache_static` / `lookup_cache_static` | string | "" | Path to a static ngram lookup cache file |
+| `spec_lookup_cache_dynamic` / `lookup_cache_dynamic` | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) |
 
 #### Speculative Type Values
 
@@ -277,6 +327,8 @@ These are set via the `options:` array in the model configuration (format: `key:
 | `ngram_mod` | Modified n-gram speculation |
 | `ngram_cache` | 3-level n-gram cache |
 
+Multiple types can be chained by passing a comma-separated list to `spec_type` (e.g. `spec_type:ngram_simple,ngram_mod`). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria.
+
 {{% notice note %}}
 Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request.
 {{% /notice %}}