mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-16 20:52:08 -04:00
* chore(llama.cpp): bump to 1ec7ba0c14f33f17e980daeeda5f35b225d41994
Picks up the upstream `spec : parallel drafting support` change
(ggml-org/llama.cpp#22838) which reshapes the speculative-decoding API
and `server_context_impl`.
Adapt the grpc-server wrapper accordingly:
* `common_params_speculative::type` (single enum) became `types`
(`std::vector<common_speculative_type>`). Update both the
"default to draft when a draft model is set" branch and the
`spec_type`/`speculative_type` option parser. The parser now also
tolerates comma-separated lists, mirroring the upstream
`common_speculative_types_from_names` semantics.
* `common_params_speculative_draft::n_ctx` is gone (draft now shares
the target context size). Keep the `draft_ctx_size` option name for
backward compatibility and ignore the value rather than failing.
* `server_context_impl::model` was renamed to `model_tgt`; update the
two reranker / model-metadata call sites.
Replaces #9763. Builds cleanly under the linux/amd64 cpu-llama-cpp
target locally.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): expose new speculative-decoding option keys
Upstream `spec : parallel drafting support` (ggml-org/llama.cpp#22838)
adds the `ngram_mod`, `ngram_map_k`, and `ngram_map_k4v` speculative
families and beefs up the draft-model knobs. The previous bump only
adapted the API; this exposes the new fields through the grpc-server
options dictionary so model configs can drive them.
New `options:` keys (all under `backend: llama-cpp`):
ngram_mod (`ngram_mod` type):
spec_ngram_mod_n_min / spec_ngram_mod_n_max / spec_ngram_mod_n_match
ngram_map_k (`ngram_map_k` type):
spec_ngram_map_k_size_n / spec_ngram_map_k_size_m / spec_ngram_map_k_min_hits
ngram_map_k4v (`ngram_map_k4v` type):
spec_ngram_map_k4v_size_n / spec_ngram_map_k4v_size_m /
spec_ngram_map_k4v_min_hits
ngram lookup caches (`ngram_cache` type):
spec_lookup_cache_static / lookup_cache_static
spec_lookup_cache_dynamic / lookup_cache_dynamic
Draft-model tuning (active when `spec_type` is `draft`):
draft_cache_type_k / spec_draft_cache_type_k
draft_cache_type_v / spec_draft_cache_type_v
draft_threads / spec_draft_threads
draft_threads_batch / spec_draft_threads_batch
draft_cpu_moe / spec_draft_cpu_moe (bool flag)
draft_n_cpu_moe / spec_draft_n_cpu_moe (first N MoE layers on CPU)
draft_override_tensor / spec_draft_override_tensor
(comma-separated <tensor regex>=<buffer type>; re-implements upstream's
static parse_tensor_buffer_overrides since it isn't exported)
`spec_type` already accepted comma-separated lists after the previous
commit, matching upstream's `common_speculative_types_from_names`.
Docs: refresh `docs/content/advanced/model-configuration.md` with
per-family tables and a note about multi-type chaining.
Builds locally with `make docker-build-llama-cpp` (linux/amd64
cpu-llama-cpp AVX variant).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): bridge new llama.cpp spec API to the legacy fork layout
The previous commits in this series adapted backend/cpp/llama-cpp/grpc-server.cpp
to the post-#22838 (parallel drafting) llama.cpp API. The turboquant build
reuses the same grpc-server.cpp through backend/cpp/turboquant/Makefile,
which copies it into turboquant-<flavor>-build/ and runs patch-grpc-server.sh
on the copy. The fork branched before the API refactor, so it errors out on:
* `ctx_server.impl->model_tgt` (fork still has `model`)
* `params.speculative.{ngram_mod,ngram_map_k,ngram_map_k4v,ngram_cache}.*`
(none of these sub-structs exist in the fork)
* `params.speculative.draft.{cache_type_k/v, cpuparams[, _batch].n_threads,
tensor_buft_overrides}` (fork uses the pre-#22397 flat layout)
* `params.speculative.types` vector / `common_speculative_types_from_names`
(fork has a scalar `type` and only the singular helper)
Approach:
1. backend/cpp/llama-cpp/grpc-server.cpp: introduce a single feature switch
`LOCALAI_LEGACY_LLAMA_CPP_SPEC`. When defined, the two `speculative.type[s]`
discriminations (the "default to draft when a draft model is set" branch
and the `spec_type` / `speculative_type` option parser) fall back to the
singular scalar form, and the entire new-option block (ngram_mod / map_k
/ map_k4v / ngram_cache / draft.{cache_type_*, cpuparams*,
tensor_buft_overrides}) is preprocessed out. The macro is *not* defined
in the source tree — stock llama-cpp builds get the full new API.
2. backend/cpp/turboquant/patch-grpc-server.sh: two new patch steps applied
to the per-flavor build copy at turboquant-<flavor>-build/grpc-server.cpp:
- substitute `ctx_server.impl->model_tgt` -> `ctx_server.impl->model`
- inject `#define LOCALAI_LEGACY_LLAMA_CPP_SPEC 1` before the first
`#include`, so the guarded blocks above drop out for the fork build.
Both patches are idempotent and follow the existing sed/awk pattern in
this script (KV cache types, `get_media_marker`, flat speculative
renames). Stock llama-cpp's `grpc-server.cpp` is never touched.
Drop both legacy patches once the turboquant fork rebases past
ggml-org/llama.cpp#22397 / #22838.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(turboquant): close draft_ctx_size brace inside legacy guard
The previous turboquant fix wrapped the new option-handler blocks in
`#ifndef LOCALAI_LEGACY_LLAMA_CPP_SPEC ... #endif` but placed the guard
in the middle of an `else if` chain — the `} else if` openings of the
new blocks were responsible for closing the previous block's brace.
With the macro defined the new blocks vanish, draft_ctx_size's `{`
loses its closer, the for-loop's `}` is consumed instead, and the
file ends with a stray opening brace — clang reports it as
`function-definition is not allowed here before '{'` on the next
top-level `int main(...)` and `expected '}' at end of input`.
Move the chain split inside the draft_ctx_size branch:
} else if (... "draft_ctx_size") {
// ...
#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
} // legacy: chain ends here
#else
} else if (... "spec_ngram_mod_n_min") { // modern: chain continues
...
} else if (... "draft_override_tensor") {
...
} // closes last branch
#endif
} // closes for-loop
Brace count is now balanced under both preprocessor branches (verified
with `tr -cd '{' | wc -c` against the patched and unpatched outputs).
Local `make docker-build-turboquant` builds the linux/amd64 cpu-llama-cpp
`turboquant-avx` variant cleanly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(ci): forward AMDGPU_TARGETS into Dockerfile.turboquant builder-prebuilt
Dockerfile.turboquant's `builder-prebuilt` stage was missing the
`ARG AMDGPU_TARGETS` / `ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}` pair that
`builder-fromsource` already has (and that `Dockerfile.llama-cpp`
mirrors across both stages). When CI uses the prebuilt base image
(quay.io/go-skynet/ci-cache:base-grpc-*, the common path) the build-arg
passed by the workflow never reaches the env inside the compile stage.
backend/cpp/llama-cpp/Makefile:38 (introduced by #9626) errors out on
hipblas builds when AMDGPU_TARGETS is empty, and the turboquant
Makefile reuses backend/cpp/llama-cpp via a sibling build dir, so the
same check fires from turboquant-fallback under BUILD_TYPE=hipblas:
Makefile:38: *** AMDGPU_TARGETS is empty — set it to a comma-separated
list of gfx targets e.g. gfx1100,gfx1101. Stop.
make: *** [Makefile:66: turboquant-fallback] Error 2
The bug is latent on master because the docker layer cache stays warm
across builds — the compile step rarely re-runs from scratch. The
llama.cpp bump in this PR invalidates the cache, so the missing env var
becomes load-bearing and the hipblas turboquant CI job fails.
Mirror the existing pattern from Dockerfile.llama-cpp.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
161 lines
6.8 KiB
Docker
161 lines
6.8 KiB
Docker
ARG BASE_IMAGE=ubuntu:24.04
|
|
# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
|
|
# when no prebuilt base is supplied. The builder-prebuilt stage is only
|
|
# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
|
|
# content here is harmless — BuildKit prunes the unreferenced builder.
|
|
ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
|
|
# BUILDER_TARGET selects which builder stage the final scratch image copies
|
|
# package output from. Declared at global scope (before any FROM) so it's
|
|
# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
|
|
# `make backends/turboquant` on the from-source path.
|
|
ARG BUILDER_TARGET=builder-fromsource
|
|
ARG APT_MIRROR=""
|
|
ARG APT_PORTS_MIRROR=""
|
|
|
|
|
|
# ============================================================================
|
|
# Stage: builder-fromsource — self-contained build path.
|
|
# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
|
|
# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
|
|
# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
|
|
# default; local `make backends/turboquant`).
|
|
#
|
|
# The install script is the same one that backend/Dockerfile.base-grpc-builder
|
|
# runs, so the result is bit-equivalent to the prebuilt-base path
|
|
# (builder-prebuilt below).
|
|
# ============================================================================
|
|
FROM ${BASE_IMAGE} AS builder-fromsource
|
|
ARG BUILD_TYPE
|
|
ARG CUDA_MAJOR_VERSION
|
|
ARG CUDA_MINOR_VERSION
|
|
ARG CMAKE_FROM_SOURCE=false
|
|
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
|
|
ARG CMAKE_VERSION=3.31.10
|
|
ARG GRPC_VERSION=v1.65.0
|
|
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
|
|
ARG SKIP_DRIVERS=false
|
|
ARG TARGETARCH
|
|
ARG TARGETVARIANT
|
|
ARG GO_VERSION=1.25.4
|
|
ARG UBUNTU_VERSION=2404
|
|
ARG APT_MIRROR
|
|
ARG APT_PORTS_MIRROR
|
|
ARG AMDGPU_TARGETS=""
|
|
ARG BACKEND=rerankers
|
|
# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
|
|
ARG CUDA_DOCKER_ARCH
|
|
ARG CMAKE_ARGS
|
|
|
|
ENV BUILD_TYPE=${BUILD_TYPE} \
|
|
CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
|
|
CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
|
|
CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
|
|
CMAKE_VERSION=${CMAKE_VERSION} \
|
|
GRPC_VERSION=${GRPC_VERSION} \
|
|
GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
|
|
SKIP_DRIVERS=${SKIP_DRIVERS} \
|
|
TARGETARCH=${TARGETARCH} \
|
|
UBUNTU_VERSION=${UBUNTU_VERSION} \
|
|
APT_MIRROR=${APT_MIRROR} \
|
|
APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
|
|
AMDGPU_TARGETS=${AMDGPU_TARGETS} \
|
|
CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
|
|
CMAKE_ARGS=${CMAKE_ARGS} \
|
|
DEBIAN_FRONTEND=noninteractive
|
|
|
|
# CUDA on PATH (no-op when CUDA isn't installed)
|
|
ENV PATH=/usr/local/cuda/bin:${PATH}
|
|
# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
|
|
ENV PATH=/opt/rocm/bin:${PATH}
|
|
|
|
WORKDIR /build
|
|
|
|
# Install everything via the shared script — the same one that
|
|
# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
|
|
# this from-source path are bit-equivalent.
|
|
RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
|
|
--mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
|
|
bash /usr/local/sbin/install-base-deps
|
|
|
|
# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
|
|
# CMake's find_package finds it at the canonical prefix the Makefile expects.
|
|
RUN cp -a /opt/grpc/. /usr/local/
|
|
|
|
COPY . /LocalAI
|
|
|
|
# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
|
|
# for rationale. turboquant is a llama.cpp fork that reuses
|
|
# backend/cpp/llama-cpp source via a thin wrapper Makefile, so MOST TUs
|
|
# are content-identical to the upstream llama-cpp build. Sharing a cache
|
|
# id with llama-cpp could give cross-fork hits — but for now keep them
|
|
# separate so a regression in one doesn't poison the other. Revisit
|
|
# sharing after measuring the actual hit rate.
|
|
#
|
|
# The compile body is shared with builder-prebuilt via .docker/turboquant-compile.sh.
|
|
RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
|
|
--mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
|
|
bash /usr/local/sbin/compile.sh
|
|
|
|
|
|
# Copy libraries using a script to handle architecture differences
|
|
RUN make -BC /LocalAI/backend/cpp/turboquant package
|
|
|
|
|
|
# ============================================================================
|
|
# Stage: builder-prebuilt — uses the pre-built base from
|
|
# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
|
|
# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
|
|
# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
|
|
# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
|
|
# builder-base-image).
|
|
# ============================================================================
|
|
FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
|
|
|
|
ARG BUILD_TYPE
|
|
ENV BUILD_TYPE=${BUILD_TYPE}
|
|
ARG CUDA_DOCKER_ARCH
|
|
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
|
|
ARG CMAKE_ARGS
|
|
ENV CMAKE_ARGS=${CMAKE_ARGS}
|
|
# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
|
|
# (which the turboquant Makefile reuses via a sibling build dir) errors out when the var
|
|
# is empty on a hipblas build, and the prebuilt path is what CI exercises most of the
|
|
# time. The builder-fromsource stage above already does this; mirror it here.
|
|
ARG AMDGPU_TARGETS
|
|
ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
|
|
ARG TARGETARCH
|
|
ARG TARGETVARIANT
|
|
|
|
# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
|
|
# /usr/local. Mirror what the from-source path does so the compile step
|
|
# can find gRPC at the canonical prefix the Makefile expects.
|
|
RUN cp -a /opt/grpc/. /usr/local/
|
|
|
|
COPY . /LocalAI
|
|
|
|
RUN --mount=type=bind,source=.docker/turboquant-compile.sh,target=/usr/local/sbin/compile.sh \
|
|
--mount=type=cache,target=/root/.ccache,id=turboquant-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
|
|
bash /usr/local/sbin/compile.sh
|
|
|
|
RUN make -BC /LocalAI/backend/cpp/turboquant package
|
|
|
|
|
|
# ============================================================================
|
|
# Final stage — copies package output from one of the two builders.
|
|
# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
|
|
#
|
|
# BuildKit doesn't support variable expansion in `COPY --from=` directly,
|
|
# so we resolve the ARG by aliasing the chosen builder to a fixed stage
|
|
# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
|
|
# BUILDER_TARGET itself is declared as a global ARG at the top of this
|
|
# file (required for use in FROM), so we just re-import it into this
|
|
# stage's scope before the FROM directive.
|
|
# ============================================================================
|
|
FROM ${BUILDER_TARGET} AS builder
|
|
|
|
FROM scratch
|
|
|
|
|
|
# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
|
|
COPY --from=builder /LocalAI/backend/cpp/turboquant/package/. ./
|