feat(backend): add turboquant llama.cpp-fork backend (#9355)

* feat(backend): add turboquant llama.cpp-fork backend turboquant is a llama.cpp fork (TheTom/llama-cpp-turboquant, branch feature/turboquant-kv-cache) that adds a TurboQuant KV-cache scheme. It ships as a first-class backend reusing backend/cpp/llama-cpp sources via a thin wrapper Makefile: each variant target copies ../llama-cpp into a sibling build dir and invokes llama-cpp's build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point at the fork. No duplication of grpc-server.cpp — upstream fixes flow through automatically. Wires up the full matrix (CPU, CUDA 12/13, L4T, L4T-CUDA13, ROCm, SYCL f32/f16, Vulkan) in backend.yml and the gallery entries in index.yaml, adds a tests-turboquant-grpc e2e job driven by BACKEND_TEST_CACHE_TYPE_K/V=q8_0 to exercise the KV-cache config path (backend_test.go gains dedicated env vars wired into ModelOptions.CacheTypeKey/Value — a generic improvement usable by any llama.cpp-family backend), and registers a nightly auto-bump PR in bump_deps.yaml tracking feature/turboquant-kv-cache. scripts/changed-backends.js gets a special-case so edits to backend/cpp/llama-cpp/ also retrigger the turboquant CI pipeline, since the wrapper reuses those sources. * feat(turboquant): carry upstream patches against fork API drift turboquant branched from llama.cpp before upstream commit 66060008 ("server: respect the ignore eos flag", #21203) which added the `logit_bias_eog` field to `server_context_meta` and a matching parameter to `server_task::params_from_json_cmpl`. The shared backend/cpp/llama-cpp/grpc-server.cpp depends on that field, so building it against the fork unmodified fails. Cherry-pick that commit as a patch file under backend/cpp/turboquant/patches/ and apply it to the cloned fork sources via a new apply-patches.sh hook called from the wrapper Makefile. Simplifies the build flow too: instead of hopping through llama-cpp's build-llama-cpp-grpc-server indirection, the wrapper now drives the copied Makefile directly (clone -> patch -> build). Drop the corresponding patch whenever the fork catches up with upstream — the build fails fast if a patch stops applying, which is the signal to retire it. * docs: add turboquant backend section + clarify cache_type_k/v Document the new turboquant (llama.cpp fork with TurboQuant KV-cache) backend alongside the existing llama-cpp / ik-llama-cpp sections in features/text-generation.md: when to pick it, how to install it from the gallery, and a YAML example showing backend: turboquant together with cache_type_k / cache_type_v. Also expand the cache_type_k / cache_type_v table rows in advanced/model-configuration.md to spell out the accepted llama.cpp quantization values and note that these fields apply to all llama.cpp-family backends, not just vLLM. * feat(turboquant): patch ggml-rpc GGML_OP_COUNT assertion The fork adds new GGML ops bringing GGML_OP_COUNT to 97, but ggml/include/ggml-rpc.h static-asserts it equals 96, breaking the GGML_RPC=ON build paths (turboquant-grpc / turboquant-rpc-server). Carry a one-line patch that updates the expected count so the assertion holds. Drop this patch whenever the fork fixes it upstream. * feat(turboquant): allow turbo* KV-cache types and exercise them in e2e The shared backend/cpp/llama-cpp/grpc-server.cpp carries its own allow-list of accepted KV-cache types (kv_cache_types[]) and rejects anything outside it before the value reaches llama.cpp's parser. That list only contains the standard llama.cpp types — turbo2/turbo3/turbo4 would throw "Unsupported cache type" at LoadModel time, meaning nothing the LocalAI gRPC layer accepted was actually fork-specific. Add a build-time augmentation step (patch-grpc-server.sh, called from the turboquant wrapper Makefile) that inserts GGML_TYPE_TURBO2_0/3_0/4_0 into the allow-list of the *copied* grpc-server.cpp under turboquant-<flavor>-build/. The original file under backend/cpp/llama-cpp/ is never touched, so the stock llama-cpp build keeps compiling against vanilla upstream which has no notion of those enum values. Switch test-extra-backend-turboquant to set BACKEND_TEST_CACHE_TYPE_K=turbo3 / _V=turbo3 so the e2e gRPC suite actually runs the fork's TurboQuant KV-cache code paths (turbo3 also auto-enables flash_attention in the fork). Picking q8_0 here would only re-test the standard llama.cpp path that the upstream llama-cpp backend already covers. Refresh the docs (text-generation.md + model-configuration.md) to list turbo2/turbo3/turbo4 explicitly and call out that you only get the TurboQuant code path with this backend + a turbo* cache type. * fix(turboquant): rewrite patch-grpc-server.sh in awk, not python3 The builder image (ubuntu:24.04 stage-2 in Dockerfile.turboquant) does not install python3, so the python-based augmentation step errored with `python3: command not found` at make time. Switch to awk, which ships in coreutils and is already available everywhere the rest of the wrapper Makefile runs. * Apply suggestion from @mudler Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@users.noreply.github.com>
2026-04-17 21:40:07 -04:00 · 2026-04-15 01:25:04 +02:00
parent 410d100cc3
commit 95efb8a562
17 changed files with 1072 additions and 14 deletions
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -353,6 +353,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "8"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-12-turboquant'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
@@ -796,6 +809,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-nvidia-cuda-13-turboquant'
+            runs-on: 'ubuntu-latest'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -809,6 +835,19 @@ jobs:
            backend: "llama-cpp"
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
+          - build-type: 'cublas'
+            cuda-major-version: "13"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            skip-drivers: 'false'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-cuda-13-arm64-turboquant'
+            base-image: "ubuntu:24.04"
+            runs-on: 'ubuntu-24.04-arm'
+            ubuntu-version: '2404'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1330,6 +1369,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'hipblas'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-rocm-hipblas-turboquant'
+            runs-on: 'ubuntu-latest'
+            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1566,6 +1618,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'sycl_f32'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-intel-sycl-f32-turboquant'
+            runs-on: 'ubuntu-latest'
+            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'sycl_f16'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1579,6 +1644,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'sycl_f16'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-intel-sycl-f16-turboquant'
+            runs-on: 'ubuntu-latest'
+            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: 'intel'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1958,6 +2036,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: ''
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-cpu-turboquant'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1984,6 +2075,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2204'
+          - build-type: 'cublas'
+            cuda-major-version: "12"
+            cuda-minor-version: "0"
+            platforms: 'linux/arm64'
+            skip-drivers: 'false'
+            tag-latest: 'auto'
+            tag-suffix: '-nvidia-l4t-arm64-turboquant'
+            base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+            runs-on: 'ubuntu-24.04-arm'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2204'
          - build-type: 'vulkan'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1997,6 +2101,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.llama-cpp"
            context: "./"
            ubuntu-version: '2404'
+          - build-type: 'vulkan'
+            cuda-major-version: ""
+            cuda-minor-version: ""
+            platforms: 'linux/amd64,linux/arm64'
+            tag-latest: 'auto'
+            tag-suffix: '-gpu-vulkan-turboquant'
+            runs-on: 'bigger-runner'
+            base-image: "ubuntu:24.04"
+            skip-drivers: 'false'
+            backend: "turboquant"
+            dockerfile: "./backend/Dockerfile.turboquant"
+            context: "./"
+            ubuntu-version: '2404'
          # Stablediffusion-ggml
          - build-type: ''
            cuda-major-version: ""
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -18,6 +18,10 @@ jobs:
            variable: "IK_LLAMA_VERSION"
            branch: "main"
            file: "backend/cpp/ik-llama-cpp/Makefile"
+          - repository: "TheTom/llama-cpp-turboquant"
+            variable: "TURBOQUANT_VERSION"
+            branch: "feature/turboquant-kv-cache"
+            file: "backend/cpp/turboquant/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -31,6 +31,7 @@ jobs:
      llama-cpp-quantization: ${{ steps.detect.outputs.llama-cpp-quantization }}
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
+      turboquant: ${{ steps.detect.outputs.turboquant }}
      vllm: ${{ steps.detect.outputs.vllm }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
@@ -519,6 +520,29 @@ jobs:
      - name: Build ik-llama-cpp backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-ik-llama-cpp
+  tests-turboquant-grpc:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.turboquant == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      # Exercises the turboquant (llama.cpp fork) backend with KV-cache
+      # quantization enabled. The convenience target sets
+      # BACKEND_TEST_CACHE_TYPE_K / _V=q8_0, which are plumbed into the
+      # ModelOptions.CacheTypeKey/Value gRPC fields. LoadModel-success +
+      # backend stdout/stderr (captured by the Ginkgo suite) prove the
+      # cache-type config path reaches the fork's KV-cache init.
+      - name: Build turboquant backend image and run gRPC e2e tests
+        run: |
+          make test-extra-backend-turboquant
  # tests-vllm-grpc is currently disabled in CI.
  #
  # The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16