From bc4cd3dd85ee74c059da56a0120d2aad07465158 Mon Sep 17 00:00:00 2001 From: "LocalAI [bot]" <139863280+localai-bot@users.noreply.github.com> Date: Tue, 12 May 2026 17:22:37 +0200 Subject: [PATCH] feat(llama-cpp): bump to `1ec7ba0c`, adapt grpc-server, expose new spec-decoding options (#9765) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994 Picks up the upstream `spec : parallel drafting support` change (ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API and `server_context_impl`. Adapt the grpc-server wrapper accordingly: * `common_params_speculative::type` (single enum) became `types` (`std::vector`). Update both the "default to draft when a draft model is set" branch and the `spec_type`/`speculative_type` option parser. The parser now also tolerates comma-separated lists, mirroring the upstream `common_speculative_types_from_names` semantics. * `common_params_speculative_draft::n_ctx` is gone (draft now shares the target context size). Keep the `draft_ctx_size` option name for backward compatibility and ignore the value rather than failing. * `server_context_impl::model` was renamed to `model_tgt`; update the two reranker / model-metadata call sites. Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp target locally. Signed-off-by: Ettore Di Giacinto * feat(llama-cpp): expose new speculative-decoding option keys Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838) adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative families and beefs up the draft-model knobs. The previous bump only adapted the API; this exposes the new fields through the grpc-server options dictionary so model configs can drive them. New `options:` keys (all under `backend: llama-cpp`): ngram_mod (`ngram_mod` type): spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match ngram_map_k (`ngram_map_k` type): spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits ngram_map_k4v (`ngram_map_k4v` type): spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m / spec_ngram_map_k4v_min_hits ngram lookup caches (`ngram_cache` type): spec_lookup_cache_static / lookup_cache_static spec_lookup_cache_dynamic / lookup_cache_dynamic Draft-model tuning (active when `spec_type` is `draft`): draft_cache_type_k / spec_draft_cache_type_k draft_cache_type_v / spec_draft_cache_type_v draft_threads / spec_draft_threads draft_threads_batch / spec_draft_threads_batch draft_cpu_moe / spec_draft_cpu_moe (bool flag) draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU) draft_override_tensor / spec_draft_override_tensor (comma-separated =; re-implements upstream's static parse_tensor_buffer_overrides since it isn't exported) `spec_type` already accepted comma-separated lists after the previous commit, matching upstream's `common_speculative_types_from_names`. Docs: refresh `docs/content/advanced/model-configuration.md` with per-family tables and a note about multi-type chaining. Builds locally with `make docker-build-llama-cpp` (linux/amd64 cpu-llama-cpp AVX variant). Signed-off-by: Ettore Di Giacinto * fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile, which copies it into turboquant--build/ and runs patch-grpc-server.sh on the copy. The fork branched before the API refactor, so it errors out on: * `ctx_server.impl->model_tgt` (fork still has `model`) * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*` (none of these sub-structs exist in the fork) * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads, tensor_buft_overrides}` (fork uses the pre-#22397 flat layout) * `params.speculative.types` vector / `common_speculative_types_from_names` (fork has a scalar `type` and only the singular helper) Approach: 1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]` discriminations (the "default to draft when a draft model is set" branch and the `spec_type` / `speculative_type` option parser) fall back to the singular scalar form, and the entire new-option block (ngram_mod / map_k / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*, tensor_buft_overrides}) is preprocessed out. The macro is *not* defined in the source tree — stock llama-cpp builds get the full new API. 2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied to the per-flavor build copy at turboquant--build/grpc-server.cpp: - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model` - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first `#include`, so the guarded blocks above drop out for the fork build. Both patches are idempotent and follow the existing sed/awk pattern in this script (KV cache types, `get_media_marker`, flat speculative renames). Stock llama-cpp's `grpc-server.cpp` is never touched. Drop both legacy patches once the turboquant fork rebases past ggml-org/llama.cpp#22397 / #22838. Signed-off-by: Ettore Di Giacinto * fix(turboquant): close draft_ctx_size brace inside legacy guard The previous turboquant fix wrapped the new option-handler blocks in `#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard in the middle of an `else if` chain — the `} else if` openings of the new blocks were responsible for closing the previous block's brace. With the macro defined the new blocks vanish, draft_ctx_size's `{` loses its closer, the for-loop's `}` is consumed instead, and the file ends with a stray opening brace — clang reports it as `function-definition is not allowed here before '{'` on the next top-level `int main(...)` and `expected '}' at end of input`. Move the chain split inside the draft_ctx_size branch: } else if (... "draft_ctx_size") { // ... #ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } // legacy: chain ends here #else } else if (... "spec_ngram_mod_n_min") { // modern: chain continues ... } else if (... "draft_override_tensor") { ... } // closes last branch #endif } // closes for-loop Brace count is now balanced under both preprocessor branches (verified with `tr -cd '{' | wc -c` against the patched and unpatched outputs). Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp `turboquant-avx` variant cleanly. Signed-off-by: Ettore Di Giacinto * fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt Dockerfile.turboquant's `builder-prebuilt` stage was missing the `ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that `builder-fromsource` already has (and that `Dockerfile.llama-cpp` mirrors across both stages). When CI uses the prebuilt base image (quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg passed by the workflow never reaches the env inside the compile stage. backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on hipblas builds when AMDGPU_TARGETS is empty, and the turboquant Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the same check fires from turboquant-fallback under BUILD_TYPE=hipblas: Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated list of gfx targets e.g. gfx1100,gfx1101. Stop. make: *** [Makefile:66: turboquant-fallback] Error 2 The bug is latent on master because the docker layer cache stays warm across builds — the compile step rarely re-runs from scratch. The llama.cpp bump in this PR invalidates the cache, so the missing env var becomes load-bearing and the hipblas turboquant CI job fails. Mirror the existing pattern from Dockerfile.llama-cpp. Signed-off-by: Ettore Di Giacinto --------- Signed-off-by: Ettore Di Giacinto Co-authored-by: Ettore Di Giacinto --- backend/Dockerfile.turboquant | 6 + backend/cpp/llama-cpp/Makefile | 2 +- backend/cpp/llama-cpp/grpc-server.cpp | 198 ++++++++++++++++++- backend/cpp/turboquant/patch-grpc-server.sh | 43 ++++ docs/content/advanced/model-configuration.md | 58 +++++- 5 files changed, 296 insertions(+), 11 deletions(-) diff --git a/backend/Dockerfile.turboquant b/backend/Dockerfile.turboquant index ffdccf416..4f3a005ee 100644 --- a/backend/Dockerfile.turboquant +++ b/backend/Dockerfile.turboquant @@ -117,6 +117,12 @@ ARG CUDA_DOCKER_ARCH ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} ARG CMAKE_ARGS ENV CMAKE_ARGS=${CMAKE_ARGS} +# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile +# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var +# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the +# time. The builder-fromsource stage above already does this; mirror it here. +ARG AMDGPU_TARGETS +ENV AMDGPU_TARGETS=${AMDGPU_TARGETS} ARG TARGETARCH ARG TARGETVARIANT diff --git a/backend/cpp/llama-cpp/Makefile b/backend/cpp/llama-cpp/Makefile index 62e2b0fb3..cd9a71860 100644 --- a/backend/cpp/llama-cpp/Makefile +++ b/backend/cpp/llama-cpp/Makefile @@ -1,5 +1,5 @@ -LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797 +LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp CMAKE_ARGS?= diff --git a/backend/cpp/llama-cpp/grpc-server.cpp b/backend/cpp/llama-cpp/grpc-server.cpp index df3d075e7..5c1d07766 100644 --- a/backend/cpp/llama-cpp/grpc-server.cpp +++ b/backend/cpp/llama-cpp/grpc-server.cpp @@ -36,6 +36,8 @@ #include #include #include +#include +#include #include #include #include @@ -443,10 +445,22 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt // Draft model for speculative decoding if (!request->draftmodel().empty()) { params.speculative.draft.mparams.path = request->draftmodel(); - // Default to draft type if a draft model is set but no explicit type + // Default to draft type if a draft model is set but no explicit type. + // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a + // vector; the turboquant fork still uses the legacy scalar. The + // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by + // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only. +#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) { params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT; } +#else + const bool no_spec_type = params.speculative.types.empty() || + (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE); + if (no_spec_type) { + params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT }; + } +#endif } // params.model_alias ?? @@ -673,10 +687,35 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt } // Speculative decoding options } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) { - auto type = common_speculative_type_from_name(optval_str); +#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC + // Fork only knows a single scalar `type`. Take the first comma- + // separated value and assign it via the singular helper. + std::string first = optval_str; + const auto comma = first.find(','); + if (comma != std::string::npos) first = first.substr(0, comma); + auto type = common_speculative_type_from_name(first); if (type != COMMON_SPECULATIVE_TYPE_COUNT) { params.speculative.type = type; } +#else + // Upstream switched to a vector of types (comma-separated for multi-type + // chaining via common_speculative_types_from_names). We keep accepting a + // single value here, but also tolerate comma-separated lists. + std::vector names; + std::string item; + for (char c : optval_str) { + if (c == ',') { + if (!item.empty()) { names.push_back(item); item.clear(); } + } else { + item.push_back(c); + } + } + if (!item.empty()) names.push_back(item); + auto parsed = common_speculative_types_from_names(names); + if (!parsed.empty()) { + params.speculative.types = parsed; + } +#endif } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) { if (optval != NULL) { try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {} @@ -710,10 +749,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {} } } else if (!strcmp(optname, "draft_ctx_size")) { - if (optval != NULL) { - try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {} - } + // The draft context size is no longer a separate field upstream: the draft + // shares the target context size. Accept the option for backward + // compatibility but silently ignore it. + +// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838 +// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`, +// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}` +// fields. The turboquant fork branched before that, so its build defines +// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option +// keys become unrecognized (silently dropped, like any unknown opt) for it. +// +// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the +// closing-brace position of the `draft_ctx_size` branch on purpose: in the +// legacy build the chain ends here (the brace closes draft_ctx_size), and in +// the modern build the chain continues with `} else if (...)` instead, so the +// brace count stays balanced under both branches of the preprocessor. +#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC } +#else + // --- ngram_mod family (upstream --spec-ngram-mod-*) --- + } else if (!strcmp(optname, "spec_ngram_mod_n_min")) { + if (optval != NULL) { + try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_mod_n_max")) { + if (optval != NULL) { + try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_mod_n_match")) { + if (optval != NULL) { + try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {} + } + + // --- ngram_map_k family (upstream --spec-ngram-map-k-*) --- + } else if (!strcmp(optname, "spec_ngram_map_k_size_n")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_map_k_size_m")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + + // --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) --- + } else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + } else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) { + if (optval != NULL) { + try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {} + } + + // --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) --- + } else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) { + params.speculative.ngram_cache.lookup_cache_static = optval_str; + } else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) { + params.speculative.ngram_cache.lookup_cache_dynamic = optval_str; + + // --- draft model KV cache types (upstream --spec-draft-type-k / -v) --- + } else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) { + params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str); + } else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) { + params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str); + + // --- draft model thread counts (upstream --spec-draft-threads / -batch) --- + } else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) { + if (optval != NULL) { + try { + int n = std::stoi(optval_str); + if (n <= 0) n = (int)std::thread::hardware_concurrency(); + params.speculative.draft.cpuparams.n_threads = n; + } catch (...) {} + } + } else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) { + if (optval != NULL) { + try { + int n = std::stoi(optval_str); + if (n <= 0) n = (int)std::thread::hardware_concurrency(); + params.speculative.draft.cpuparams_batch.n_threads = n; + } catch (...) {} + } + + // --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) --- + } else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) { + // Bool-style flag: optval may be missing, "true"/"1"/"yes" enables. + const bool enable = (optval == NULL) || + optval_str == "true" || optval_str == "1" || optval_str == "yes" || + optval_str == "on" || optval_str == "enabled"; + if (enable) { + params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override()); + } + } else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) { + if (optval != NULL) { + try { + int n = std::stoi(optval_str); + if (n < 0) n = 0; + // Keep override-name storage alive for the lifetime of the params struct + // (mirrors upstream arg.cpp behavior with a function-local static). + static std::list buft_overrides_draft; + for (int i = 0; i < n; ++i) { + buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i)); + params.speculative.draft.tensor_buft_overrides.push_back( + {buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()}); + } + } catch (...) {} + } + + // --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) --- + } else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) { + // Format: =,=,... + // We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp). + ggml_backend_load_all(); + std::map buft_list; + for (size_t i = 0; i < ggml_backend_dev_count(); ++i) { + auto * dev = ggml_backend_dev_get(i); + auto * buft = ggml_backend_dev_buffer_type(dev); + if (buft) { + buft_list[ggml_backend_buft_name(buft)] = buft; + } + } + static std::list draft_override_names; + std::string cur; + auto flush = [&](const std::string & spec) { + auto pos = spec.find('='); + if (pos == std::string::npos) return; + const std::string name = spec.substr(0, pos); + const std::string type = spec.substr(pos + 1); + auto it = buft_list.find(type); + if (it == buft_list.end()) return; // unknown buffer type: ignore + draft_override_names.push_back(name); + params.speculative.draft.tensor_buft_overrides.push_back( + {draft_override_names.back().c_str(), it->second}); + }; + for (char c : optval_str) { + if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } } + else { cur.push_back(c); } + } + if (!cur.empty()) flush(cur); + } +#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size } // Set params.n_parallel from environment variable if not set via options (fallback) @@ -2704,7 +2888,7 @@ public: tasks.reserve(documents.size()); for (size_t i = 0; i < documents.size(); i++) { - auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]); + auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]); server_task task = server_task(SERVER_TASK_TYPE_RERANK); task.id = rd.queue_tasks.get_new_id(); task.index = i; @@ -2882,7 +3066,7 @@ public: // Get template source and reconstruct a common_chat_template for analysis std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get()); if (!tmpl_src.empty()) { - const auto * vocab = llama_model_get_vocab(ctx_server.impl->model); + const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt); std::string token_bos, token_eos; if (vocab) { auto bos_id = llama_vocab_bos(vocab); diff --git a/backend/cpp/turboquant/patch-grpc-server.sh b/backend/cpp/turboquant/patch-grpc-server.sh index a4c2df62c..d071c6156 100755 --- a/backend/cpp/turboquant/patch-grpc-server.sh +++ b/backend/cpp/turboquant/patch-grpc-server.sh @@ -108,4 +108,47 @@ else echo "==> $SRC has no post-#22397 speculative field refs, skipping spec rename patch" fi +# 4. Revert the `ctx_server.impl->model_tgt` rename introduced by upstream +# ggml-org/llama.cpp#22838 (parallel drafting). The turboquant fork still +# exposes the field as `model` on `server_context_impl`. The two call sites +# are in the Rerank and ModelMetadata RPC handlers. +if grep -q 'ctx_server\.impl->model_tgt' "$SRC"; then + echo "==> patching $SRC to revert ctx_server.impl->model_tgt -> ctx_server.impl->model" + sed -E 's/ctx_server\.impl->model_tgt/ctx_server.impl->model/g' "$SRC" > "$SRC.tmp" + mv "$SRC.tmp" "$SRC" + echo "==> model_tgt rename OK" +else + echo "==> $SRC has no ctx_server.impl->model_tgt refs, skipping model_tgt rename patch" +fi + +# 5. Define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top of the file so the +# grpc-server option parser skips the new option-handler blocks (ngram_mod, +# ngram_map_k, ngram_map_k4v, ngram_cache, draft.cache_type_*, draft.cpuparams*, +# draft.tensor_buft_overrides) introduced for the post-#22838 layout. Those +# blocks reference struct fields that simply do not exist in the fork. +if grep -q '^#define LOCALAI_LEGACY_LLAMA_CPP_SPEC' "$SRC"; then + echo "==> $SRC already defines LOCALAI_LEGACY_LLAMA_CPP_SPEC, skipping" +else + echo "==> patching $SRC to define LOCALAI_LEGACY_LLAMA_CPP_SPEC at the top" + # Insert the define before the very first `#include` so it precedes all the + # speculative-decoding code paths. + awk ' + !done && /^#include/ { + print "#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1" + print "// ^ injected by backend/cpp/turboquant/patch-grpc-server.sh" + print "" + done = 1 + } + { print } + END { + if (!done) { + print "patch-grpc-server.sh: no #include anchor found to insert LOCALAI_LEGACY_LLAMA_CPP_SPEC" > "/dev/stderr" + exit 1 + } + } + ' "$SRC" > "$SRC.tmp" + mv "$SRC.tmp" "$SRC" + echo "==> LOCALAI_LEGACY_LLAMA_CPP_SPEC define OK" +fi + echo "==> all patches applied" diff --git a/docs/content/advanced/model-configuration.md b/docs/content/advanced/model-configuration.md index 172e50b65..02aa555ce 100644 --- a/docs/content/advanced/model-configuration.md +++ b/docs/content/advanced/model-configuration.md @@ -251,18 +251,68 @@ options: These are set via the `options:` array in the model configuration (format: `key:value`): +**Common options** + | Option | Type | Default | Description | |--------|------|---------|-------------| -| `spec_type` | string | `none` | Speculative decoding type (see table below) | +| `spec_type` / `speculative_type` | string | `none` | Speculative decoding type, or comma-separated list to chain multiple (see table below) | | `spec_n_max` / `draft_max` | int | 16 | Maximum number of tokens to draft per step | | `spec_n_min` / `draft_min` | int | 0 | Minimum draft tokens required to use speculation | | `spec_p_min` / `draft_p_min` | float | 0.75 | Minimum probability threshold for greedy acceptance | | `spec_p_split` | float | 0.1 | Split probability for tree-based branching | + +**Draft-model options** (apply when `spec_type=draft`, i.e. a `draft_model` is configured) + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) | +| `draft_threads` / `spec_draft_threads` | int | same as main | Threads used by the draft model (`<= 0` = hardware concurrency) | +| `draft_threads_batch` / `spec_draft_threads_batch` | int | same as `draft_threads` | Threads used by the draft model during batch / prompt processing | +| `draft_cache_type_k` / `spec_draft_cache_type_k` | string | `f16` | KV cache K data type for the draft model (same values as `cache_type_k`) | +| `draft_cache_type_v` / `spec_draft_cache_type_v` | string | `f16` | KV cache V data type for the draft model | +| `draft_cpu_moe` / `spec_draft_cpu_moe` | bool | false | Keep all MoE expert weights of the draft model on CPU | +| `draft_n_cpu_moe` / `spec_draft_n_cpu_moe` | int | 0 | Keep MoE expert weights of the first N draft-model layers on CPU | +| `draft_override_tensor` / `spec_draft_override_tensor` | string | "" | Comma-separated `=` overrides for the draft model | +| `draft_ctx_size` | int | (ignored) | Deprecated upstream: the draft now shares the target context size. Accepted for backward compatibility but has no effect. | + +**`ngram_simple` options** (used when `spec_type` includes `ngram_simple`) + +| Option | Type | Default | Description | +|--------|------|---------|-------------| | `spec_ngram_size_n` / `ngram_size_n` | int | 12 | N-gram lookup size | | `spec_ngram_size_m` / `ngram_size_m` | int | 48 | M-gram proposal size | | `spec_ngram_min_hits` / `ngram_min_hits` | int | 1 | Minimum hits for accepting n-gram proposals | -| `draft_gpu_layers` | int | -1 | GPU layers for the draft model (-1 = use default) | -| `draft_ctx_size` | int | 0 | Context size for the draft model (0 = auto) | + +**`ngram_mod` options** (used when `spec_type` includes `ngram_mod`) + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `spec_ngram_mod_n_min` | int | 48 | Minimum number of ngram tokens to use | +| `spec_ngram_mod_n_max` | int | 64 | Maximum number of ngram tokens to use | +| `spec_ngram_mod_n_match` | int | 24 | Ngram lookup length | + +**`ngram_map_k` options** (used when `spec_type` includes `ngram_map_k`) + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `spec_ngram_map_k_size_n` | int | 12 | N-gram lookup size | +| `spec_ngram_map_k_size_m` | int | 48 | M-gram proposal size | +| `spec_ngram_map_k_min_hits` | int | 1 | Minimum hits for accepting proposals | + +**`ngram_map_k4v` options** (used when `spec_type` includes `ngram_map_k4v`) + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `spec_ngram_map_k4v_size_n` | int | 12 | N-gram lookup size | +| `spec_ngram_map_k4v_size_m` | int | 48 | M-gram proposal size | +| `spec_ngram_map_k4v_min_hits` | int | 1 | Minimum hits for accepting proposals | + +**`ngram_cache` lookup files** + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `spec_lookup_cache_static` / `lookup_cache_static` | string | "" | Path to a static ngram lookup cache file | +| `spec_lookup_cache_dynamic` / `lookup_cache_dynamic` | string | "" | Path to a dynamic ngram lookup cache file (updated by generation) | #### Speculative Type Values @@ -277,6 +327,8 @@ These are set via the `options:` array in the model configuration (format: `key: | `ngram_mod` | Modified n-gram speculation | | `ngram_cache` | 3-level n-gram cache | +Multiple types can be chained by passing a comma-separated list to `spec_type` (e.g. `spec_type:ngram_simple,ngram_mod`). The runtime tries them in order and accepts the first proposal that meets the acceptance criteria. + {{% notice note %}} Speculative decoding is automatically disabled when multimodal models (with `mmproj`) are active. The `n_draft` parameter can also be overridden per-request. {{% /notice %}}