mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-17 04:56:52 -04:00
feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994
Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.
Adapt the grpc-server wrapper accordingly:
* `common_params_speculative::type` (single enum) became `types`
(`std::vector<common_speculative_type>`). Update both the
"default to draft when a draft model is set" branch and the
`spec_type`/`speculative_type` option parser. The parser now also
tolerates comma-separated lists, mirroring the upstream
`common_speculative_types_from_names` semantics.
* `common_params_speculative_draft::n_ctx` is gone (draft now shares
the target context size). Keep the `draft_ctx_size` option name for
backward compatibility and ignore the value rather than failing.
* `server_context_impl::model` was renamed to `model_tgt`; update the
two reranker / model-metadata call sites.
Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.
New `options:` keys (all under `backend: llama-cpp`):
ngram_mod (`ngram_mod` type):
spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match
ngram_map_k (`ngram_map_k` type):
spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits
ngram_map_k4v (`ngram_map_k4v` type):
spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
spec_ngram_map_k4v_min_hits
ngram lookup caches (`ngram_cache` type):
spec_lookup_cache_static / lookup_cache_static
spec_lookup_cache_dynamic / lookup_cache_dynamic
Draft-model tuning (active when `spec_type` is `draft`):
draft_cache_type_k / spec_draft_cache_type_k
draft_cache_type_v / spec_draft_cache_type_v
draft_threads / spec_draft_threads
draft_threads_batch / spec_draft_threads_batch
draft_cpu_moe / spec_draft_cpu_moe (bool flag)
draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU)
draft_override_tensor / spec_draft_override_tensor
(comma-separated <tensor regex>=<buffer type>; re-implements upstream's
static parse_tensor_buffer_overrides since it isn't exported)
`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.
Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.
Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout
The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:
* `ctx_server.impl->model_tgt` (fork still has `model`)
* `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
(none of these sub-structs exist in the fork)
* `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
* `params.speculative.types` vector / `common_speculative_types_from_names`
(fork has a scalar `type` and only the singular helper)
Approach:
1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
`LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
discriminations (the "default to draft when a draft model is set" branch
and the `spec_type` / `speculative_type` option parser) fall back to the
singular scalar form, and the entire new-option block (ngram_mod / map_k
/ map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
in the source tree — stock llama-cpp builds get the full new API.
2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
- substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
- inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
`#include`, so the guarded blocks above drop out for the fork build.
Both patches are idempotent and follow the existing sed/awk pattern in
this script (KV cache types, `get_media_marker`, flat speculative
renames). Stock llama-cpp's `grpc-server.cpp` is never touched.
Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): close draft_ctx_size brace inside legacy guard
The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.
Move the chain split inside the draft_ctx_size branch:
} else if (... "draft_ctx_size") {
// ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
} // legacy: chain ends here
#else
} else if (... "spec_ngram_mod_n_min") { // modern: chain continues
...
} else if (... "draft_override_tensor") {
...
} // closes last branch
#endif
} // closes for-loop
Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).
Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt
Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.
backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:
Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
list of gfx targets e.g. gfx1100,gfx1101. Stop.
make: *** [Makefile:66: turboquant-fallback] Error 2
The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.
Mirror the existing pattern from Dockerfile.llama-cpp.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -36,6 +36,8 @@
|
||||
#include <cstdlib>
|
||||
#include <fstream>
|
||||
#include <iterator>
|
||||
#include <list>
|
||||
#include <map>
|
||||
#include <mutex>
|
||||
#include <signal.h>
|
||||
#include <thread>
|
||||
@@ -443,10 +445,22 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
// Draft model for speculative decoding
|
||||
if (!request->draftmodel().empty()) {
|
||||
params.speculative.draft.mparams.path = request->draftmodel();
|
||||
// Default to draft type if a draft model is set but no explicit type
|
||||
// Default to draft type if a draft model is set but no explicit type.
|
||||
// Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
|
||||
// vector; the turboquant fork still uses the legacy scalar. The
|
||||
// LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
|
||||
// backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
|
||||
params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
|
||||
}
|
||||
#else
|
||||
const bool no_spec_type = params.speculative.types.empty() ||
|
||||
(params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
|
||||
if (no_spec_type) {
|
||||
params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT };
|
||||
}
|
||||
#endif
|
||||
}
|
||||
|
||||
// params.model_alias ??
|
||||
@@ -673,10 +687,35 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
}
|
||||
// Speculative decoding options
|
||||
} else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
|
||||
auto type = common_speculative_type_from_name(optval_str);
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
// Fork only knows a single scalar `type`. Take the first comma-
|
||||
// separated value and assign it via the singular helper.
|
||||
std::string first = optval_str;
|
||||
const auto comma = first.find(',');
|
||||
if (comma != std::string::npos) first = first.substr(0, comma);
|
||||
auto type = common_speculative_type_from_name(first);
|
||||
if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
|
||||
params.speculative.type = type;
|
||||
}
|
||||
#else
|
||||
// Upstream switched to a vector of types (comma-separated for multi-type
|
||||
// chaining via common_speculative_types_from_names). We keep accepting a
|
||||
// single value here, but also tolerate comma-separated lists.
|
||||
std::vector<std::string> names;
|
||||
std::string item;
|
||||
for (char c : optval_str) {
|
||||
if (c == ',') {
|
||||
if (!item.empty()) { names.push_back(item); item.clear(); }
|
||||
} else {
|
||||
item.push_back(c);
|
||||
}
|
||||
}
|
||||
if (!item.empty()) names.push_back(item);
|
||||
auto parsed = common_speculative_types_from_names(names);
|
||||
if (!parsed.empty()) {
|
||||
params.speculative.types = parsed;
|
||||
}
|
||||
#endif
|
||||
} else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
|
||||
@@ -710,10 +749,155 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
|
||||
try { params.speculative.draft.n_gpu_layers = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_ctx_size")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.draft.n_ctx = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
// The draft context size is no longer a separate field upstream: the draft
|
||||
// shares the target context size. Accept the option for backward
|
||||
// compatibility but silently ignore it.
|
||||
|
||||
// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
|
||||
// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
|
||||
// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
|
||||
// fields. The turboquant fork branched before that, so its build defines
|
||||
// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
|
||||
// keys become unrecognized (silently dropped, like any unknown opt) for it.
|
||||
//
|
||||
// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
|
||||
// closing-brace position of the `draft_ctx_size` branch on purpose: in the
|
||||
// legacy build the chain ends here (the brace closes draft_ctx_size), and in
|
||||
// the modern build the chain continues with `} else if (...)` instead, so the
|
||||
// brace count stays balanced under both branches of the preprocessor.
|
||||
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
|
||||
}
|
||||
#else
|
||||
// --- ngram_mod family (upstream --spec-ngram-mod-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_min = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_max")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_max = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_mod_n_match")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_mod.n_match = std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram_map_k family (upstream --spec-ngram-map-k-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_size_n")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_size_m")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k_min_hits")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram_map_k4v family (upstream --spec-ngram-map-k4v-*) ---
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_n")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.size_n = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_size_m")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.size_m = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "spec_ngram_map_k4v_min_hits")) {
|
||||
if (optval != NULL) {
|
||||
try { params.speculative.ngram_map_k4v.min_hits = (uint16_t)std::stoi(optval_str); } catch (...) {}
|
||||
}
|
||||
|
||||
// --- ngram lookup caches (upstream --lookup-cache-static / -dynamic) ---
|
||||
} else if (!strcmp(optname, "spec_lookup_cache_static") || !strcmp(optname, "lookup_cache_static")) {
|
||||
params.speculative.ngram_cache.lookup_cache_static = optval_str;
|
||||
} else if (!strcmp(optname, "spec_lookup_cache_dynamic") || !strcmp(optname, "lookup_cache_dynamic")) {
|
||||
params.speculative.ngram_cache.lookup_cache_dynamic = optval_str;
|
||||
|
||||
// --- draft model KV cache types (upstream --spec-draft-type-k / -v) ---
|
||||
} else if (!strcmp(optname, "draft_cache_type_k") || !strcmp(optname, "spec_draft_cache_type_k")) {
|
||||
params.speculative.draft.cache_type_k = kv_cache_type_from_str(optval_str);
|
||||
} else if (!strcmp(optname, "draft_cache_type_v") || !strcmp(optname, "spec_draft_cache_type_v")) {
|
||||
params.speculative.draft.cache_type_v = kv_cache_type_from_str(optval_str);
|
||||
|
||||
// --- draft model thread counts (upstream --spec-draft-threads / -batch) ---
|
||||
} else if (!strcmp(optname, "draft_threads") || !strcmp(optname, "spec_draft_threads")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n <= 0) n = (int)std::thread::hardware_concurrency();
|
||||
params.speculative.draft.cpuparams.n_threads = n;
|
||||
} catch (...) {}
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_threads_batch") || !strcmp(optname, "spec_draft_threads_batch")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n <= 0) n = (int)std::thread::hardware_concurrency();
|
||||
params.speculative.draft.cpuparams_batch.n_threads = n;
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
// --- draft model MoE on CPU (upstream --spec-draft-cpu-moe / --spec-draft-n-cpu-moe) ---
|
||||
} else if (!strcmp(optname, "draft_cpu_moe") || !strcmp(optname, "spec_draft_cpu_moe")) {
|
||||
// Bool-style flag: optval may be missing, "true"/"1"/"yes" enables.
|
||||
const bool enable = (optval == NULL) ||
|
||||
optval_str == "true" || optval_str == "1" || optval_str == "yes" ||
|
||||
optval_str == "on" || optval_str == "enabled";
|
||||
if (enable) {
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(llm_ffn_exps_cpu_override());
|
||||
}
|
||||
} else if (!strcmp(optname, "draft_n_cpu_moe") || !strcmp(optname, "spec_draft_n_cpu_moe")) {
|
||||
if (optval != NULL) {
|
||||
try {
|
||||
int n = std::stoi(optval_str);
|
||||
if (n < 0) n = 0;
|
||||
// Keep override-name storage alive for the lifetime of the params struct
|
||||
// (mirrors upstream arg.cpp behavior with a function-local static).
|
||||
static std::list<std::string> buft_overrides_draft;
|
||||
for (int i = 0; i < n; ++i) {
|
||||
buft_overrides_draft.push_back(llm_ffn_exps_block_regex(i));
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(
|
||||
{buft_overrides_draft.back().c_str(), ggml_backend_cpu_buffer_type()});
|
||||
}
|
||||
} catch (...) {}
|
||||
}
|
||||
|
||||
// --- draft model tensor buffer overrides (upstream --spec-draft-override-tensor) ---
|
||||
} else if (!strcmp(optname, "draft_override_tensor") || !strcmp(optname, "spec_draft_override_tensor")) {
|
||||
// Format: <tensor regex>=<buffer type>,<tensor regex>=<buffer type>,...
|
||||
// We replicate upstream's parse_tensor_buffer_overrides (static in arg.cpp).
|
||||
ggml_backend_load_all();
|
||||
std::map<std::string, ggml_backend_buffer_type_t> buft_list;
|
||||
for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
|
||||
auto * dev = ggml_backend_dev_get(i);
|
||||
auto * buft = ggml_backend_dev_buffer_type(dev);
|
||||
if (buft) {
|
||||
buft_list[ggml_backend_buft_name(buft)] = buft;
|
||||
}
|
||||
}
|
||||
static std::list<std::string> draft_override_names;
|
||||
std::string cur;
|
||||
auto flush = [&](const std::string & spec) {
|
||||
auto pos = spec.find('=');
|
||||
if (pos == std::string::npos) return;
|
||||
const std::string name = spec.substr(0, pos);
|
||||
const std::string type = spec.substr(pos + 1);
|
||||
auto it = buft_list.find(type);
|
||||
if (it == buft_list.end()) return; // unknown buffer type: ignore
|
||||
draft_override_names.push_back(name);
|
||||
params.speculative.draft.tensor_buft_overrides.push_back(
|
||||
{draft_override_names.back().c_str(), it->second});
|
||||
};
|
||||
for (char c : optval_str) {
|
||||
if (c == ',') { if (!cur.empty()) { flush(cur); cur.clear(); } }
|
||||
else { cur.push_back(c); }
|
||||
}
|
||||
if (!cur.empty()) flush(cur);
|
||||
}
|
||||
#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
|
||||
}
|
||||
|
||||
// Set params.n_parallel from environment variable if not set via options (fallback)
|
||||
@@ -2704,7 +2888,7 @@ public:
|
||||
|
||||
tasks.reserve(documents.size());
|
||||
for (size_t i = 0; i < documents.size(); i++) {
|
||||
auto tmp = format_prompt_rerank(ctx_server.impl->model, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
|
||||
auto tmp = format_prompt_rerank(ctx_server.impl->model_tgt, ctx_server.impl->vocab, ctx_server.impl->mctx, request->query(), documents[i]);
|
||||
server_task task = server_task(SERVER_TASK_TYPE_RERANK);
|
||||
task.id = rd.queue_tasks.get_new_id();
|
||||
task.index = i;
|
||||
@@ -2882,7 +3066,7 @@ public:
|
||||
// Get template source and reconstruct a common_chat_template for analysis
|
||||
std::string tmpl_src = common_chat_templates_source(ctx_server.impl->chat_params.tmpls.get());
|
||||
if (!tmpl_src.empty()) {
|
||||
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model);
|
||||
const auto * vocab = llama_model_get_vocab(ctx_server.impl->model_tgt);
|
||||
std::string token_bos, token_eos;
|
||||
if (vocab) {
|
||||
auto bos_id = llama_vocab_bos(vocab);
|
||||
|
||||
Reference in New Issue
Block a user