feat(llama-cpp): bump to 1ec7ba0c, adapt grpc-server, expose new spec-decoding options (#9765)

* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994

Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.

Adapt the grpc-server wrapper accordingly:

  * `common_params_speculative::type` (single enum) became `types`
    (`std::vector<common_speculative_type>`). Update both the
    "default to draft when a draft model is set" branch and the
    `spec_type`/`speculative_type` option parser. The parser now also
    tolerates comma-separated lists, mirroring the upstream
    `common_speculative_types_from_names` semantics.
  * `common_params_speculative_draft::n_ctx` is gone (draft now shares
    the target context size). Keep the `draft_ctx_size` option name for
    backward compatibility and ignore the value rather than failing.
  * `server_context_impl::model` was renamed to `model_tgt`; update the
    two reranker / model-metadata call sites.

Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* feat(llama-cpp): expose new speculative-decoding option keys

Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.

New `options:` keys (all under `backend: llama-cpp`):

ngram_mod (`ngram_mod` type):
  spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match

ngram_map_k (`ngram_map_k` type):
  spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits

ngram_map_k4v (`ngram_map_k4v` type):
  spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
  spec_ngram_map_k4v_min_hits

ngram lookup caches (`ngram_cache` type):
  spec_lookup_cache_static / lookup_cache_static
  spec_lookup_cache_dynamic / lookup_cache_dynamic

Draft-model tuning (active when `spec_type` is `draft`):
  draft_cache_type_k / spec_draft_cache_type_k
  draft_cache_type_v / spec_draft_cache_type_v
  draft_threads / spec_draft_threads
  draft_threads_batch / spec_draft_threads_batch
  draft_cpu_moe / spec_draft_cpu_moe          (bool flag)
  draft_n_cpu_moe / spec_draft_n_cpu_moe      (first N MoE layers on CPU)
  draft_override_tensor / spec_draft_override_tensor
    (comma-separated <tensor regex>=<buffer type>; re-implements upstream's
     static parse_tensor_buffer_overrides since it isn't exported)

`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.

Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.

Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout

The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:

  * `ctx_server.impl->model_tgt` (fork still has `model`)
  * `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
    (none of these sub-structs exist in the fork)
  * `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
    tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
  * `params.speculative.types` vector / `common_speculative_types_from_names`
    (fork has a scalar `type` and only the singular helper)

Approach:

1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
   `LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
   discriminations (the "default to draft when a draft model is set" branch
   and the `spec_type` / `speculative_type` option parser) fall back to the
   singular scalar form, and the entire new-option block (ngram_mod / map_k
   / map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
   tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
   in the source tree — stock llama-cpp builds get the full new API.

2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
   to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
   - substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
   - inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
     `#include`, so the guarded blocks above drop out for the fork build.

   Both patches are idempotent and follow the existing sed/awk pattern in
   this script (KV cache types, `get_media_marker`, flat speculative
   renames). Stock llama-cpp's `grpc-server.cpp` is never touched.

Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(turboquant): close draft_ctx_size brace inside legacy guard

The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.

Move the chain split inside the draft_ctx_size branch:

    } else if (... "draft_ctx_size") {
        // ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
    }                                  // legacy: chain ends here
#else
    } else if (... "spec_ngram_mod_n_min") {  // modern: chain continues
        ...
    } else if (... "draft_override_tensor") {
        ...
    }                                  // closes last branch
#endif
    }                                  // closes for-loop

Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).

Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt

Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.

backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:

  Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
  list of gfx targets e.g. gfx1100,gfx1101.  Stop.
  make: *** [Makefile:66: turboquant-fallback] Error 2

The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.

Mirror the existing pattern from Dockerfile.llama-cpp.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
LocalAI [bot]
2026-05-12 17:22:37 +02:00
committed by GitHub
parent 86a7f6c9fa
commit bc4cd3dd85
5 changed files with 296 additions and 11 deletions

View File

@@ -1,5 +1,5 @@
LLAMA_VERSION?=389ff61d77b5c71cec0cf92fe4e5d01ace80b797
LLAMA_VERSION?=1ec7ba0c14f33f17e980daeeda5f35b225d41994
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
CMAKE_ARGS?=