mirror of
https://github.com/mudler/LocalAI.git
synced 2026-04-17 21:40:07 -04:00
* feat(backend): add turboquant llama.cpp-fork backend
turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch
feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme.
It ships as a first-class backend reusing backend/cpp/llama-cpp sources
via a thin wrapper Makefile: each variant target copies ../llama-cpp
into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server
with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No
duplication of grpc-server.cpp — upstream fixes flow through automatically.
Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL
f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml,
adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0
to exercise the KV-cache config path (backend_test.go gains dedicated env
vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement
usable by any llama.cpp-family backend), and registers a nightly auto-bump
PR in bump_deps.yaml tracking feature/turboquant-kv-cache.
scripts/changed-backends.js gets a special-case so edits to
backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since
the wrapper reuses those sources.
* feat(turboquant): carry upstream patches against fork API drift
turboquant branched from llama.cpp before upstream commit 66060008
("server: respect the ignore eos flag", #21203) which added the
`logit_bias_eog` field to `server_context_meta` and a matching
parameter to `server_task::params_from_json_cmpl`. The shared
backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so
building it against the fork unmodified fails.
Cherry-pick that commit as a patch file under
backend/cpp/turboquant/patches/ and apply it to the cloned fork
sources via a new apply-patches.sh hook called from the wrapper
Makefile. Simplifies the build flow too: instead of hopping through
llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now
drives the copied Makefile directly (clone -> patch -> build).
Drop the corresponding patch whenever the fork catches up with
upstream — the build fails fast if a patch stops applying, which
is the signal to retire it.
* docs: add turboquant backend section + clarify cache_type_k/v
Document the new turboquant (llama.cpp fork with TurboQuant KV-cache)
backend alongside the existing llama-cpp / ik-llama-cpp sections in
features/text-generation.md: when to pick it, how to install it from
the gallery, and a YAML example showing backend: turboquant together
with cache_type_k / cache_type_v.
Also expand the cache_type_k / cache_type_v table rows in
advanced/model-configuration.md to spell out the accepted llama.cpp
quantization values and note that these fields apply to all
llama.cpp-family backends, not just vLLM.
* feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion
The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but
ggml/include/ggml-rpc.h static-asserts it equals 96, breaking
the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server).
Carry a one-line patch that updates the expected count so the
assertion holds. Drop this patch whenever the fork fixes it upstream.
* feat(turboquant): allow turbo* KV-cache types and exercise them in e2e
The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own
allow-list of accepted KV-cache types (kv_cache_types[]) and rejects
anything outside it before the value reaches llama.cpp's parser. That
list only contains the standard llama.cpp types — turbo2/turbo3/turbo4
would throw "Unsupported cache type" at LoadModel time, meaning
nothing the LocalAI gRPC layer accepted was actually fork-specific.
Add a build-time augmentation step (patch-grpc-server.sh, called from
the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0
into the allow-list of the *copied* grpc-server.cpp under
turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/
is never touched, so the stock llama-cpp build keeps compiling against
vanilla upstream which has no notion of those enum values.
Switch test-extra-backend-turboquant to set
BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite
actually runs the fork's TurboQuant KV-cache code paths (turbo3 also
auto-enables flash_attention in the fork). Picking q8_0 here would
only re-test the standard llama.cpp path that the upstream llama-cpp
backend already covers.
Refresh the docs (text-generation.md + model-configuration.md) to
list turbo2/turbo3/turbo4 explicitly and call out that you only get
the TurboQuant code path with this backend + a turbo* cache type.
* fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3
The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant)
does not install python3, so the python-based augmentation step
errored with `python3: command not found` at make time. Switch to
awk, which ships in coreutils and is already available everywhere
the rest of the wrapper Makefile runs.
* Apply suggestion from @mudler
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
---------
Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
66 lines
1.6 KiB
Bash
Executable File
66 lines
1.6 KiB
Bash
Executable File
#!/bin/bash
|
|
set -ex
|
|
|
|
# Get the absolute current dir where the script is located
|
|
CURDIR=$(dirname "$(realpath $0)")
|
|
|
|
cd /
|
|
|
|
echo "CPU info:"
|
|
grep -e "model\sname" /proc/cpuinfo | head -1
|
|
grep -e "flags" /proc/cpuinfo | head -1
|
|
|
|
BINARY=turboquant-fallback
|
|
|
|
if grep -q -e "\savx\s" /proc/cpuinfo ; then
|
|
echo "CPU: AVX found OK"
|
|
if [ -e $CURDIR/turboquant-avx ]; then
|
|
BINARY=turboquant-avx
|
|
fi
|
|
fi
|
|
|
|
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
|
|
echo "CPU: AVX2 found OK"
|
|
if [ -e $CURDIR/turboquant-avx2 ]; then
|
|
BINARY=turboquant-avx2
|
|
fi
|
|
fi
|
|
|
|
# Check avx 512
|
|
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
|
|
echo "CPU: AVX512F found OK"
|
|
if [ -e $CURDIR/turboquant-avx512 ]; then
|
|
BINARY=turboquant-avx512
|
|
fi
|
|
fi
|
|
|
|
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
|
|
if [ -e $CURDIR/turboquant-grpc ]; then
|
|
BINARY=turboquant-grpc
|
|
fi
|
|
fi
|
|
|
|
# Extend ld library path with the dir where this script is located/lib
|
|
if [ "$(uname)" == "Darwin" ]; then
|
|
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
|
|
else
|
|
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
|
|
# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
|
|
if [ -d "$CURDIR/lib/rocblas/library" ]; then
|
|
export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
|
|
fi
|
|
fi
|
|
|
|
# If there is a lib/ld.so, use it
|
|
if [ -f $CURDIR/lib/ld.so ]; then
|
|
echo "Using lib/ld.so"
|
|
echo "Using binary: $BINARY"
|
|
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
|
|
fi
|
|
|
|
echo "Using binary: $BINARY"
|
|
exec $CURDIR/$BINARY "$@"
|
|
|
|
# We should never reach this point, however just in case we do, run fallback
|
|
exec $CURDIR/turboquant-fallback "$@"
|