fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles

Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls
2026-05-24 16:51:44 -04:00 · 2026-04-24 20:09:36 +00:00 · 2026-04-24 16:29:29 +00:00 · 2026-04-24 13:57:30 +00:00 · 2026-04-24 12:52:54 +00:00 · 2026-04-24 12:52:53 +00:00
20 changed files with 1160 additions and 17 deletions
--- a/.agents/ai-coding-assistants.md
+++ b/.agents/ai-coding-assistants.md
@@ -35,19 +35,33 @@ All contributions must comply with LocalAI's licensing requirements:
 ## Signed-off-by and Developer Certificate of Origin
-**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
+Only humans can certify the Developer Certificate of Origin (DCO). AI
-certify the Developer Certificate of Origin (DCO). The human submitter
+agents MUST NOT invent or guess a human identity for `Signed-off-by` —
-is responsible for:
+doing so forges the DCO certification.
- Reviewing all AI-generated code
+However, when a human operator explicitly directs the AI to commit on
 their behalf, the AI is acting as a typing tool — no different from an
 editor macro or `git commit -s`. In that case the AI SHOULD add
 `Signed-off-by:` using the **configured `user.name` / `user.email`** of
 the current git repository (i.e. the operator's own identity). The
 resulting trailer is the operator's signature; they take responsibility
 for it by reviewing and pushing the commit. The AI MUST NOT use any
 other identity and MUST NOT add its own name to the sign-off.
 When running `git commit`, prefer `git commit --signoff` (or `-s`) so
 the trailer is emitted by git itself from the configured identity,
 rather than hand-writing it in a heredoc — this guarantees the sign-off
 matches whatever identity the operator is currently using.
 The human submitter remains responsible for:
 - Reviewing all AI-generated code before it's pushed or merged
 - Ensuring compliance with licensing requirements
 - Adding their own `Signed-off-by` tag (when the project requires DCO)
  to certify the contribution
 - Taking full responsibility for the contribution
-AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
+AI agents MUST NOT add `Co-Authored-By` trailers for themselves. A human
-A human reviewer owns the contribution; the AI's involvement is recorded
+reviewer owns the contribution; the AI's involvement is recorded via
-via `Assisted-by` (see below).
+`Assisted-by` (see below).
 ## Attribution
@@ -84,6 +98,12 @@ Assisted-by: Claude:claude-opus-4-7 golangci-lint
 Signed-off-by: Jane Developer <jane@example.com>
 ```
 The `Signed-off-by` line uses Jane's own identity because Jane is the
 submitter operating the AI. If Jane asks Claude to create the commit via
 `git commit -s`, git emits that exact trailer from Jane's configured
 identity — no separate human step is needed beyond Jane reviewing the
 diff before pushing.
 ## Scope and Responsibility
 Using an AI assistant does not reduce the contributor's responsibility.
--- a/.github/workflows/backend.yml
+++ b/.github/workflows/backend.yml
@@ -399,6 +399,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
            platforms: 'linux/amd64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-nvidia-cuda-12-buun-llama-cpp'
            runs-on: 'bigger-runner'
            base-image: "ubuntu:24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "8"
@@ -894,6 +907,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
            platforms: 'linux/amd64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-nvidia-cuda-13-buun-llama-cpp'
            runs-on: 'ubuntu-latest'
            base-image: "ubuntu:24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -920,6 +946,19 @@ jobs:
            backend: "turboquant"
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
            platforms: 'linux/arm64'
            skip-drivers: 'false'
            tag-latest: 'auto'
            tag-suffix: '-nvidia-l4t-cuda-13-arm64-buun-llama-cpp'
            base-image: "ubuntu:24.04"
            runs-on: 'ubuntu-24.04-arm'
            ubuntu-version: '2404'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
          - build-type: 'cublas'
            cuda-major-version: "13"
            cuda-minor-version: "0"
@@ -1454,6 +1493,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
            platforms: 'linux/amd64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-rocm-hipblas-buun-llama-cpp'
            runs-on: 'ubuntu-latest'
            base-image: "rocm/dev-ubuntu-24.04:7.2.1"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'hipblas'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1703,6 +1755,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'sycl_f32'
            cuda-major-version: ""
            cuda-minor-version: ""
            platforms: 'linux/amd64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-sycl-f32-buun-llama-cpp'
            runs-on: 'ubuntu-latest'
            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'sycl_f16'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -1729,6 +1794,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'sycl_f16'
            cuda-major-version: ""
            cuda-minor-version: ""
            platforms: 'linux/amd64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-intel-sycl-f16-buun-llama-cpp'
            runs-on: 'ubuntu-latest'
            base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'intel'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2134,6 +2212,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-cpu-buun-llama-cpp'
            runs-on: 'bigger-runner'
            base-image: "ubuntu:24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          - build-type: ''
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2173,6 +2264,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2204'
          - build-type: 'cublas'
            cuda-major-version: "12"
            cuda-minor-version: "0"
            platforms: 'linux/arm64'
            skip-drivers: 'false'
            tag-latest: 'auto'
            tag-suffix: '-nvidia-l4t-arm64-buun-llama-cpp'
            base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
            runs-on: 'ubuntu-24.04-arm'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2204'
          - build-type: 'vulkan'
            cuda-major-version: ""
            cuda-minor-version: ""
@@ -2199,6 +2303,19 @@ jobs:
            dockerfile: "./backend/Dockerfile.turboquant"
            context: "./"
            ubuntu-version: '2404'
          - build-type: 'vulkan'
            cuda-major-version: ""
            cuda-minor-version: ""
            platforms: 'linux/amd64,linux/arm64'
            tag-latest: 'auto'
            tag-suffix: '-gpu-vulkan-buun-llama-cpp'
            runs-on: 'bigger-runner'
            base-image: "ubuntu:24.04"
            skip-drivers: 'false'
            backend: "buun-llama-cpp"
            dockerfile: "./backend/Dockerfile.buun-llama-cpp"
            context: "./"
            ubuntu-version: '2404'
          # Stablediffusion-ggml
          - build-type: ''
            cuda-major-version: ""
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -32,6 +32,7 @@ jobs:
      llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
      ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
      turboquant: ${{ steps.detect.outputs.turboquant }}
      buun-llama-cpp: ${{ steps.detect.outputs['buun-llama-cpp'] }}
      vllm: ${{ steps.detect.outputs.vllm }}
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
@@ -613,6 +614,30 @@ jobs:
      - name: Build turboquant backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-turboquant
  tests-buun-llama-cpp-grpc:
    needs: detect-changes
    if: needs.detect-changes.outputs['buun-llama-cpp'] == 'true' || needs.detect-changes.outputs.run-all == 'true'
    runs-on: ubuntu-latest
    timeout-minutes: 90
    steps:
      - name: Clone
        uses: actions/checkout@v6
        with:
          submodules: true
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.25.4'
      # Exercises the buun-llama-cpp (fork-of-a-fork) backend with the
      # fork-specific TurboQuant/TCQ KV-cache types. BACKEND_TEST_CACHE_TYPE_V
      # is set to turbo3 so the test round-trips through the fork's KV
      # allow-list — picking a stock llama.cpp type would only re-test the
      # shared code path. DFlash speculative decoding is not exercised here
      # because the one known public target/drafter pair (Qwen3.5-27B) is too
      # large for CI.
      - name: Build buun-llama-cpp backend image and run gRPC e2e tests
        run: |
          make test-extra-backend-buun-llama-cpp
  # tests-vllm-grpc is currently disabled in CI.
  #
  # The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16
--- a/23
+++ b/23
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/buun-llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -545,6 +545,19 @@ test-extra-backend-turboquant: docker-build-turboquant
 	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
 	$(MAKE) test-extra-backend
 ## buun-llama-cpp: exercises the fork-of-a-fork backend (spiritbuun/buun-llama-cpp)
 ## with the *TurboQuant/TCQ-specific* KV-cache types (turbo3 for V). Same rationale
 ## as turboquant above: picking a standard llama.cpp type would only re-test the
 ## shared code path. buun inherits turboquant's turbo2/turbo3/turbo4 and adds
 ## turbo2_tcq / turbo3_tcq on top. DFlash speculative decoding is not exercised
 ## here because no small DFlash drafter model exists (the known public pair is
 ## Qwen3.5-27B, ~54 GB).
 test-extra-backend-buun-llama-cpp: docker-build-buun-llama-cpp
 	BACKEND_IMAGE=local-ai-backend:buun-llama-cpp \
 	BACKEND_TEST_CACHE_TYPE_K=q8_0 \
 	BACKEND_TEST_CACHE_TYPE_V=turbo3 \
 	$(MAKE) test-extra-backend
 ## Audio transcription wrapper for the llama-cpp backend.
 ## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
 ## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
@@ -949,6 +962,11 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
 # TheTom/llama-cpp-turboquant) that adds DFlash block-diffusion speculative
 # decoding and extra TCQ KV-cache variants on top of TurboQuant. Same thin
 # wrapper pattern as turboquant — reuses backend/cpp/llama-cpp grpc-server.
 BACKEND_BUUN_LLAMA_CPP = buun-llama-cpp|buun-llama-cpp|.|false|false
 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
@@ -1029,6 +1047,7 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_BUUN_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1080,7 +1099,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar
-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-buun-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.buun-llama-cpp
+++ b/backend/Dockerfile.buun-llama-cpp
@@ -0,0 +1,290 @@
 ARG BASE_IMAGE=ubuntu:24.04
 ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
 # The grpc target does one thing, it builds and installs GRPC.  This is in it's own layer so that it can be effectively cached by CI.
 # You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
 FROM ${GRPC_BASE_IMAGE} AS grpc
 # This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
 ARG GRPC_MAKEFLAGS="-j4 -Otarget"
 ARG GRPC_VERSION=v1.65.0
 ARG CMAKE_FROM_SOURCE=false
 # CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
 ARG CMAKE_VERSION=3.31.10
 ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
 WORKDIR /build
 RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        ca-certificates \
        build-essential curl libssl-dev \
        git wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
 # Install CMake (the version in 22.04 is too old)
 RUN <<EOT bash
    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
    else
        apt-get update && \
        apt-get install -y \
            cmake && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
    fi
 EOT
 # We install GRPC to a different prefix here so that we can copy in only the build artifacts later
 # saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
 # and running make install in the target container
 RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
    mkdir -p /build/grpc/cmake/build && \
    cd /build/grpc/cmake/build && \
    sed -i "216i\  TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
    cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
    make && \
    make install && \
    rm -rf /build
 FROM ${BASE_IMAGE} AS builder
 ARG CMAKE_FROM_SOURCE=false
 ARG CMAKE_VERSION=3.31.10
 # We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
 ARG CUDA_DOCKER_ARCH
 ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
 ARG CMAKE_ARGS
 ENV CMAKE_ARGS=${CMAKE_ARGS}
 ARG BACKEND=rerankers
 ARG BUILD_TYPE
 ENV BUILD_TYPE=${BUILD_TYPE}
 ARG CUDA_MAJOR_VERSION
 ARG CUDA_MINOR_VERSION
 ARG SKIP_DRIVERS=false
 ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
 ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
 ENV DEBIAN_FRONTEND=noninteractive
 ARG TARGETARCH
 ARG TARGETVARIANT
 ARG GO_VERSION=1.25.4
 ARG UBUNTU_VERSION=2404
 RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        ccache git \
        ca-certificates \
        make \
        pkg-config libcurl4-openssl-dev \
        curl unzip \
        libssl-dev wget && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*
 # Cuda
 ENV PATH=/usr/local/cuda/bin:${PATH}
 # HipBLAS requirements
 ENV PATH=/opt/rocm/bin:${PATH}
 # Vulkan requirements
 RUN <<EOT bash
    if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y  --no-install-recommends \
            software-properties-common pciutils wget gpg-agent && \
        apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
            libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
            libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
            git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
            ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
            clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
        if [ "amd64" = "$TARGETARCH" ]; then
            wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
            tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
            rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
            mkdir -p /opt/vulkan-sdk && \
            mv 1.4.335.0 /opt/vulkan-sdk/ && \
            cd /opt/vulkan-sdk/1.4.335.0 && \
            ./vulkansdk --no-deps --maxjobs \
                vulkan-loader \
                vulkan-validationlayers \
                vulkan-extensionlayer \
                vulkan-tools \
                shaderc && \
            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
            cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
            rm -rf /opt/vulkan-sdk
        fi
        if [ "arm64" = "$TARGETARCH" ]; then
            mkdir vulkan && cd vulkan && \
            curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
            tar -xvf vulkan-sdk.tar.xz && \
            rm vulkan-sdk.tar.xz && \
            cd 1.4.335.0 && \
            cp -rfv aarch64/bin/* /usr/bin/ && \
            cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
            cp -rfv aarch64/include/* /usr/include/ && \
            cp -rfv aarch64/share/* /usr/share/ && \
            cd ../.. && \
            rm -rf vulkan
        fi
        ldconfig && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
    fi
 EOT
 # CuBLAS requirements
 RUN <<EOT bash
    if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
        apt-get update && \
        apt-get install -y  --no-install-recommends \
            software-properties-common pciutils
        if [ "amd64" = "$TARGETARCH" ]; then
            curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
        fi
        if [ "arm64" = "$TARGETARCH" ]; then
            if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
            else
                curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
            fi
        fi
        dpkg -i cuda-keyring_1.1-1_all.deb && \
        rm -f cuda-keyring_1.1-1_all.deb && \
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
            apt-get install -y --no-install-recommends \
            libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
        fi
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
    fi
 EOT
 # https://github.com/NVIDIA/Isaac-GR00T/issues/343
 RUN <<EOT bash
    if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
        wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
        dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
        cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
        apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
        wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
        dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
        cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
        apt-get update && apt-get install -y nvpl
    fi
 EOT
 # If we are building with clblas support, we need the libraries for the builds
 RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
        apt-get update && \
        apt-get install -y --no-install-recommends \
            libclblast-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* \
    ; fi
 RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
        apt-get update && \
        apt-get install -y --no-install-recommends \
            hipblas-dev \
            rocblas-dev && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/* && \
        # I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
        # to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
        ldconfig && \
        # Log which GPU architectures have rocBLAS kernel support
        echo "rocBLAS library data architectures:" && \
        (ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
        echo "WARNING: No rocBLAS kernel data found" \
    ; fi
 RUN echo "TARGETARCH: $TARGETARCH"
 # We need protoc installed, and the version in 22.04 is too old.  We will create one as part installing the GRPC build below
 # but that will also being in a newer version of absl which stablediffusion cannot compile with.  This version of protoc is only
 # here so that we can generate the grpc code for the stablediffusion build
 RUN <<EOT bash
    if [ "amd64" = "$TARGETARCH" ]; then
        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
        rm protoc.zip
    fi
    if [ "arm64" = "$TARGETARCH" ]; then
        curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
        unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
        rm protoc.zip
    fi
 EOT
 # Install CMake (the version in 22.04 is too old)
 RUN <<EOT bash
    if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
        curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
    else
        apt-get update && \
        apt-get install -y \
            cmake && \
        apt-get clean && \
        rm -rf /var/lib/apt/lists/*
    fi
 EOT
 COPY --from=grpc /opt/grpc /usr/local
 COPY . /LocalAI
 RUN <<'EOT' bash
 set -euxo pipefail
 if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
  export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
  rm -rf /LocalAI/backend/cpp/buun-llama-cpp-*-build
 fi
 cd /LocalAI/backend/cpp/buun-llama-cpp
 if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
  make buun-llama-cpp-fallback
  make buun-llama-cpp-grpc
  make buun-llama-cpp-rpc-server
 else
  make buun-llama-cpp-avx
  make buun-llama-cpp-avx2
  make buun-llama-cpp-avx512
  make buun-llama-cpp-fallback
  make buun-llama-cpp-grpc
  make buun-llama-cpp-rpc-server
 fi
 EOT
 # Copy libraries using a script to handle architecture differences
 RUN make -BC /LocalAI/backend/cpp/buun-llama-cpp package
 FROM scratch
 # Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
 COPY --from=builder /LocalAI/backend/cpp/buun-llama-cpp/package/. ./
--- a/backend/cpp/buun-llama-cpp/Makefile
+++ b/backend/cpp/buun-llama-cpp/Makefile
@@ -0,0 +1,85 @@
 # Pinned to the HEAD of master on https://github.com/spiritbuun/buun-llama-cpp.
 # Auto-bumped nightly by .github/workflows/bump_deps.yaml.
 BUUN_LLAMA_VERSION?=22464d0848b87c5d56b52fdf6af2e5da46bf803e
 LLAMA_REPO?=https://github.com/spiritbuun/buun-llama-cpp
 CMAKE_ARGS?=
 BUILD_TYPE?=
 NATIVE?=false
 ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
 TARGET?=--target grpc-server
 JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
 ARCH?=$(shell uname -m)
 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
 LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
 GREEN := \033[0;32m
 RESET := \033[0m
 # buun-llama-cpp is a llama.cpp fork-of-a-fork (spiritbuun/buun-llama-cpp forked
 # TheTom/llama-cpp-turboquant, which itself forked ggml-org/llama.cpp). Rather
 # than duplicating grpc-server.cpp / CMakeLists.txt / prepare.sh we reuse the
 # ones in backend/cpp/llama-cpp, and only swap which repo+sha the fetch step
 # pulls. Each flavor target copies ../llama-cpp into a sibling
 # ../buun-llama-cpp-<flavor>-build directory, then invokes llama-cpp's own
 # build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point
 # at the fork.
 PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
 # Each flavor target:
 #   1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
 #      into a sibling buun-llama-cpp-<flavor>-build directory;
 #   2. clones the buun fork into buun-llama-cpp-<flavor>-build/llama.cpp via the
 #      copy's own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
 #   3. applies patches from backend/cpp/buun-llama-cpp/patches/ to the cloned
 #      fork sources (for backporting upstream commits the fork hasn't pulled);
 #   4. runs the copy's `grpc-server` target, which produces the binary we copy
 #      up as buun-llama-cpp-<flavor>.
 define buun-llama-cpp-build
 	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
 	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build purge
 	# Augment the copied grpc-server.cpp's KV-cache allow-list with the
 	# fork's turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq types and wire up the
 	# DFlash-specific option handlers (tree_budget / draft_topk). We patch the
 	# *copy*, never the original under backend/cpp/llama-cpp/, so the stock
 	# llama-cpp build stays compiling against vanilla upstream.
 	bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server.cpp
 	$(info $(GREEN)I buun-llama-cpp build info:$(1)$(RESET))
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build llama.cpp
 	bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/llama.cpp $(PATCHES_DIR)
 	CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
 	LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
 	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build grpc-server
 	cp -rfv $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server buun-llama-cpp-$(1)
 endef
 buun-llama-cpp-avx2:
 	$(call buun-llama-cpp-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
 buun-llama-cpp-avx512:
 	$(call buun-llama-cpp-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
 buun-llama-cpp-avx:
 	$(call buun-llama-cpp-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
 buun-llama-cpp-fallback:
 	$(call buun-llama-cpp-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
 buun-llama-cpp-grpc:
 	$(call buun-llama-cpp-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
 buun-llama-cpp-rpc-server: buun-llama-cpp-grpc
 	cp -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server buun-llama-cpp-rpc-server
 package:
 	bash package.sh
 purge:
 	rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-*-build
 	rm -rf buun-llama-cpp-* package
 clean: purge
--- a/backend/cpp/buun-llama-cpp/apply-patches.sh
+++ b/backend/cpp/buun-llama-cpp/apply-patches.sh
@@ -0,0 +1,50 @@
 #!/bin/bash
 # Apply the buun-llama-cpp patch series to a cloned buun-llama-cpp checkout.
 #
 # buun-llama-cpp is a fork-of-a-fork that branched off upstream llama.cpp
 # before some API changes the shared backend/cpp/llama-cpp/grpc-server.cpp
 # depends on. We carry those upstream commits as patch files under
 # backend/cpp/buun-llama-cpp/patches/ and apply them here so the reused
 # grpc-server source compiles against the fork unmodified.
 #
 # Drop the corresponding patch from patches/ whenever the fork catches up with
 # upstream — the build will fail fast if a patch stops applying, which is the
 # signal to retire it.
 set -euo pipefail
 if [[ $# -ne 2 ]]; then
    echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
    exit 2
 fi
 SRC_DIR=$1
 PATCHES_DIR=$2
 if [[ ! -d "$SRC_DIR" ]]; then
    echo "source dir does not exist: $SRC_DIR" >&2
    exit 2
 fi
 if [[ ! -d "$PATCHES_DIR" ]]; then
    echo "no patches dir at $PATCHES_DIR, nothing to apply"
    exit 0
 fi
 shopt -s nullglob
 patches=("$PATCHES_DIR"/*.patch)
 shopt -u nullglob
 if [[ ${#patches[@]} -eq 0 ]]; then
    echo "no .patch files in $PATCHES_DIR, nothing to apply"
    exit 0
 fi
 cd "$SRC_DIR"
 for patch in "${patches[@]}"; do
    echo "==> applying $patch"
    git apply --verbose "$patch"
 done
 echo "all buun-llama-cpp patches applied successfully"
--- a/backend/cpp/buun-llama-cpp/package.sh
+++ b/backend/cpp/buun-llama-cpp/package.sh
@@ -0,0 +1,57 @@
 #!/bin/bash
 # Script to copy the appropriate libraries based on architecture
 # This script is used in the final stage of the Dockerfile
 set -e
 CURDIR=$(dirname "$(realpath $0)")
 REPO_ROOT="${CURDIR}/../../.."
 # Create lib directory
 mkdir -p $CURDIR/package/lib
 cp -avrf $CURDIR/buun-llama-cpp-* $CURDIR/package/
 cp -rfv $CURDIR/run.sh $CURDIR/package/
 # Detect architecture and copy appropriate libraries
 if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
    # x86_64 architecture
    echo "Detected x86_64 architecture, copying x86_64 libraries..."
    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
    # ARM64 architecture
    echo "Detected ARM64 architecture, copying ARM64 libraries..."
    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
 else
    echo "Error: Could not detect architecture"
    exit 1
 fi
 # Package GPU libraries based on BUILD_TYPE
 GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
 if [ -f "$GPU_LIB_SCRIPT" ]; then
    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
    package_gpu_libs
 fi
 echo "Packaging completed successfully"
 ls -liah $CURDIR/package/
 ls -liah $CURDIR/package/lib/
--- a/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
+++ b/backend/cpp/buun-llama-cpp/patch-grpc-server.sh
@@ -0,0 +1,162 @@
 #!/bin/bash
 # Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
 # buun-llama-cpp build to account for three gaps between upstream and the fork:
 #
 #   1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
 #      fork-specific `turbo2` / `turbo3` / `turbo4` cache types plus the buun
 #      additions `turbo2_tcq` / `turbo3_tcq`.
 #
 #   2. Wire up buun-exclusive speculative-decoding option handlers
 #      (tree_budget / draft_topk) alongside the existing spec_* handlers.
 #      These reference struct fields (common_params.speculative.tree_budget
 #      and .draft_topk) that only exist in buun's common/common.h — adding
 #      them to the shared backend/cpp/llama-cpp/grpc-server.cpp would break
 #      the stock llama-cpp build, so we inject them only into the buun copy.
 #
 #   3. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
 #      server-side random per-instance marker) with the legacy "<__media__>"
 #      literal. The fork branched before that PR, so server-common.cpp has no
 #      get_media_marker symbol. The fork's mtmd_default_marker() still returns
 #      "<__media__>", and Go-side tooling falls back to that sentinel when the
 #      backend does not expose media_marker, so substituting the literal keeps
 #      behavior identical on the buun path.
 #
 # We patch the *copy* sitting in buun-llama-cpp-<flavor>-build/, never the
 # original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
 # compiling against vanilla upstream.
 #
 # Idempotent: skips each insertion if its marker is already present (so re-runs
 # of the same build dir don't double-insert).
 set -euo pipefail
 if [[ $# -ne 1 ]]; then
    echo "usage: $0 <grpc-server.cpp>" >&2
    exit 2
 fi
 SRC=$1
 if [[ ! -f "$SRC" ]]; then
    echo "grpc-server.cpp not found at $SRC" >&2
    exit 2
 fi
 if grep -q 'GGML_TYPE_TURBO2_TCQ' "$SRC"; then
    echo "==> $SRC already has buun cache types, skipping KV allow-list patch"
 else
    echo "==> patching $SRC to allow turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq KV-cache types"
    # Insert the five TURBO entries right after the first `    GGML_TYPE_Q5_1,`
    # line (the kv_cache_types[] allow-list). Using awk because the builder
    # image does not ship python3, and GNU sed's multi-line `a\` quoting is
    # awkward.
    awk '
        /^    GGML_TYPE_Q5_1,$/ && !done {
            print
            print "    // buun-llama-cpp fork extras — added by patch-grpc-server.sh"
            print "    GGML_TYPE_TURBO2_0,"
            print "    GGML_TYPE_TURBO3_0,"
            print "    GGML_TYPE_TURBO4_0,"
            print "    GGML_TYPE_TURBO2_TCQ,"
            print "    GGML_TYPE_TURBO3_TCQ,"
            done = 1
            next
        }
        { print }
        END {
            if (!done) {
                print "patch-grpc-server.sh: anchor `    GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
                exit 1
            }
        }
    ' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
    echo "==> KV allow-list patch OK"
 fi
 if grep -q 'optname, "tree_budget"' "$SRC"; then
    echo "==> $SRC already has DFlash option handlers, skipping"
 else
    echo "==> patching $SRC to add tree_budget / draft_topk option handlers"
    # Insert two new `else if` handlers between the inner close-brace of the
    # `spec_p_split` block and the next `} else if (…spec_ngram_size_n…)` line.
    # Upstream writes each `} else if` as a single physical line, so we don't
    # emit an outer `}` ourselves — the existing next line provides both the
    # close of our `draft_topk` block and the open of `spec_ngram_size_n`.
    # Anchor on the exact 3-line body of spec_p_split so we can't drift.
    awk '
        prev2 == "        } else if (!strcmp(optname, \"spec_p_split\")) {" &&
        prev1 ~ /^ +if \(optval != NULL\) \{$/ &&
        $0    ~ /^ +try \{ params\.speculative\.p_split = std::stof\(optval_str\); \} catch \(\.\.\.\) \{\}$/ &&
        !done {
            print                        # print the try-line itself
            getline inner_close          # read "            }" closing the inner if
            print inner_close            # print it — this closes spec_p_split body
            print "        // buun-llama-cpp DFlash options — added by patch-grpc-server.sh"
            print "        } else if (!strcmp(optname, \"tree_budget\")) {"
            print "            if (optval != NULL) {"
            print "                try { params.speculative.tree_budget = std::stoi(optval_str); } catch (...) {}"
            print "            }"
            print "        } else if (!strcmp(optname, \"draft_topk\")) {"
            print "            if (optval != NULL) {"
            print "                try { params.speculative.draft_topk = std::stoi(optval_str); } catch (...) {}"
            print "            }"
            # The next source line (`} else if (…spec_ngram_size_n…) {`) closes
            # our draft_topk block and continues the chain naturally; fall back
            # into the main loop to emit it and everything after.
            done = 1
            prev2 = prev1
            prev1 = inner_close
            next
        }
        { print; prev2 = prev1; prev1 = $0 }
        END {
            if (!done) {
                print "patch-grpc-server.sh: spec_p_split anchor not found" > "/dev/stderr"
                exit 1
            }
        }
    ' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
    echo "==> DFlash option-handler patch OK"
 fi
 if grep -qE 'ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog,' "$SRC"; then
    echo "==> patching $SRC to drop the logit_bias_eog arg from params_from_json_cmpl() callsites (buun still uses the pre-refactor 4-arg signature)"
    # Upstream llama.cpp refactored params_from_json_cmpl to take a precomputed
    # logit_bias_eog vector after buun's 2026-04-05 fork-point — simultaneously
    # adding server_context_meta::logit_bias_eog as the supplier. Buun carries
    # neither change: its params_from_json_cmpl is still 4-arg, and internally
    # derives logit_bias_eog from the common_params it's passed. So we just
    # delete the argument line entirely — the remaining 4 args match buun's
    # signature and the resulting behavior matches upstream bit-for-bit
    # (upstream's 5th arg is the same data buun derives internally).
    #
    # Guard is broad so this works whether the line has been run through this
    # block before (leaving params_base.sampling.logit_bias_eog,) or not
    # (leaving the original ctx_server.get_meta().logit_bias_eog,).
    sed -E '/^[[:space:]]+(ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog),$/d' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
    echo "==> logit_bias_eog arg drop OK"
 else
    echo "==> $SRC has no logit_bias_eog arg line, skipping"
 fi
 if grep -q 'get_media_marker()' "$SRC"; then
    echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
    # Only one call site today (ModelMetadata), but replace all occurrences to
    # stay robust if upstream adds more. Use a temp file to avoid relying on
    # sed -i portability (the builder image uses GNU sed, but keeping this
    # consistent with the awk block above).
    sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
    mv "$SRC.tmp" "$SRC"
    echo "==> get_media_marker() substitution OK"
 else
    echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
 fi
 echo "==> all patches applied"
--- a/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0001-fattn-atomicAdd-double-shim.patch
@@ -0,0 +1,46 @@
 Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
 Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
  atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
 but native double atomicAdd is only available on compute capability 6.0
 and newer. Compiling against a CUDA arch list that includes older
 architectures (LocalAI's CUDA 12 Docker image builds for the full
 published arch range) fails with:
    fattn.cu(812): error: no instance of overloaded function "atomicAdd"
      matches the argument list, argument types are: (double *, double)
 Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
 pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
 intrinsic is used and the shim is elided via __CUDA_ARCH__.
 --- a/ggml/src/ggml-cuda/fattn.cu
 +++ b/ggml/src/ggml-cuda/fattn.cu
@@ -7,6 +7,27 @@
 #include <atomic>
 +// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
 +// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
 +// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
 +// atomicAdd with a double*; without this definition, nvcc fails to find a
 +// matching overload whenever the compile target list includes pre-sm_60
 +// architectures. The standard CAS loop implementation below matches the
 +// semantics of the native intrinsic.
 +#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
 +static __device__ double atomicAdd(double * address, double val) {
 +    unsigned long long int * address_as_ull = (unsigned long long int *)address;
 +    unsigned long long int old = *address_as_ull;
 +    unsigned long long int assumed;
 +    do {
 +        assumed = old;
 +        old = atomicCAS(address_as_ull, assumed,
 +                        __double_as_longlong(val + __longlong_as_double(assumed)));
 +    } while (assumed != old);
 +    return __longlong_as_double(old);
 +}
 +#endif
 +
 // InnerQ: update the fattn-side inverse scale array from host (all devices)
 void turbo_innerq_update_fattn_scales(const float * scale_inv) {
     int cur_device;
--- a/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0002-argmax-shfl-xor-sync-add-width.patch
@@ -0,0 +1,32 @@
 Subject: [PATCH] ggml-cuda/argmax: pass WARP_SIZE to the top-K __shfl_xor_sync calls
 Two __shfl_xor_sync calls in the top-K intra-warp merge drop the `width`
 argument and rely on the CUDA default (warpSize). Every other call in
 the same file already passes WARP_SIZE explicitly, and the HIP/ROCm
 compatibility shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a 4-arg
 function-like macro — so the 3-arg form fails to preprocess when
 building with hipcc against ROCm:
    argmax.cu:265: error: too few arguments provided to function-like
      macro invocation
    note: macro '__shfl_xor_sync' defined here:
      #define __shfl_xor_sync(mask, var, laneMask, width) \
              __shfl_xor(var, laneMask, width)
 Align the two call sites with the rest of the file by passing WARP_SIZE
 explicitly. On CUDA the generated code is unchanged (warpSize is the
 default); on HIP it now matches the macro's arity.
 --- a/ggml/src/ggml-cuda/argmax.cu
 +++ b/ggml/src/ggml-cuda/argmax.cu
@@ -262,8 +262,8 @@
     // Each step: lane gets partner's min element, if it beats our min, replace and re-heapify
     for (int offset = WARP_SIZE / 2; offset > 0; offset >>= 1) {
         for (int i = 0; i < K; i++) {
 -            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset);
 -            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset);
 +            float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset, WARP_SIZE);
 +            int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset, WARP_SIZE);
             if (partner_val > heap_val[0]) {
                 heap_val[0] = partner_val;
                 heap_idx[0] = partner_idx;
--- a/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch
@@ -0,0 +1,24 @@
 Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
 Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
 cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
 ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
 family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
 missing the symbol variants. Building with hipcc therefore fails with
 15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
 Add the two missing aliases alongside the existing memcpy block. HIP
 provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
 equivalents, so this is a straight name substitution.
 --- a/ggml/src/ggml-cuda/vendors/hip.h
 +++ b/ggml/src/ggml-cuda/vendors/hip.h
@@ -85,6 +85,8 @@
 #define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
 #define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
 #define cudaMemcpyHostToDevice hipMemcpyHostToDevice
 +#define cudaMemcpyToSymbol hipMemcpyToSymbol
 +#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
 #define cudaMemcpyKind hipMemcpyKind
 #define cudaMemset hipMemset
 #define cudaMemsetAsync hipMemsetAsync
--- a/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
+++ b/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch
@@ -0,0 +1,36 @@
 Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
 Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
 the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
 use the 3-arg CUDA form and omit the `width` argument that the HIP
 function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
    fattn.cu:512: too few arguments provided to function-like macro
      invocation
    note: macro '__shfl_xor_sync' defined here:
      #define __shfl_xor_sync(mask, var, laneMask, width) \
              __shfl_xor(var, laneMask, width)
 Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
 default); HIP now matches the macro arity.
 --- a/ggml/src/ggml-cuda/fattn.cu
 +++ b/ggml/src/ggml-cuda/fattn.cu
@@ -509,7 +509,7 @@
     // Intra-warp passes: shuffle xor with stride h, no smem, no sync.
     #pragma unroll
     for (int h = 1; h <= 16; h *= 2) {
 -        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
 +        const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
         val = (tid & h) ? (other - val) : (val + other);
     }
@@ -533,7 +533,7 @@
 static __device__ __forceinline__ void fwht128_store_half(
         float val, half * dst_base) {
     const int tid = threadIdx.x;
 -    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
 +    const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
     if ((tid & 1) == 0) {
         const half2 packed = __floats2half2_rn(val, neighbor);
         *((half2 *)(dst_base + tid)) = packed;
--- a/backend/cpp/buun-llama-cpp/run.sh
+++ b/backend/cpp/buun-llama-cpp/run.sh
@@ -0,0 +1,65 @@
 #!/bin/bash
 set -ex
 # Get the absolute current dir where the script is located
 CURDIR=$(dirname "$(realpath $0)")
 cd /
 echo "CPU info:"
 grep -e "model\sname" /proc/cpuinfo | head -1
 grep -e "flags" /proc/cpuinfo | head -1
 BINARY=buun-llama-cpp-fallback
 if grep -q -e "\savx\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX    found OK"
 	if [ -e $CURDIR/buun-llama-cpp-avx ]; then
 		BINARY=buun-llama-cpp-avx
 	fi
 fi
 if grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   found OK"
 	if [ -e $CURDIR/buun-llama-cpp-avx2 ]; then
 		BINARY=buun-llama-cpp-avx2
 	fi
 fi
 # Check avx 512
 if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX512F found OK"
 	if [ -e $CURDIR/buun-llama-cpp-avx512 ]; then
 		BINARY=buun-llama-cpp-avx512
 	fi
 fi
 if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
 	if [ -e $CURDIR/buun-llama-cpp-grpc ]; then
 		BINARY=buun-llama-cpp-grpc
 	fi
 fi
 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
 	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
 else
 	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
 	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
 	if [ -d "$CURDIR/lib/rocblas/library" ]; then
 		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
 	fi
 fi
 # If there is a lib/ld.so, use it
 if [ -f $CURDIR/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
 	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
 fi
 echo "Using binary: $BINARY"
 exec $CURDIR/$BINARY "$@"
 # We should never reach this point, however just in case we do, run fallback
 exec $CURDIR/buun-llama-cpp-fallback "$@"
--- a/core/config/meta/constants.go
+++ b/core/config/meta/constants.go
@@ -37,6 +37,14 @@ var CacheTypeOptions = []FieldOption{
 	{Value: "q4_1", Label: "Q4_1"},
 	{Value: "q5_0", Label: "Q5_0"},
 	{Value: "q5_1", Label: "Q5_1"},
 	// TurboQuant KV-cache types — accepted by the turboquant and
 	// buun-llama-cpp fork backends; stock llama-cpp will reject them at load.
 	{Value: "turbo2", Label: "Turbo2 (TurboQuant)"},
 	{Value: "turbo3", Label: "Turbo3 (TurboQuant)"},
 	{Value: "turbo4", Label: "Turbo4 (TurboQuant)"},
 	// Trellis-Coded Quantization variants — buun-llama-cpp only.
 	{Value: "turbo2_tcq", Label: "Turbo2 TCQ (buun-llama-cpp)"},
 	{Value: "turbo3_tcq", Label: "Turbo3 TCQ (buun-llama-cpp)"},
 }
 var DiffusersPipelineOptions = []FieldOption{
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -34,6 +34,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
 	return []KnownBackendEntry{
 		{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
 		{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
 		{Name: "buun-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with DFlash speculative decoding and TurboQuant/TCQ KV-cache quantization"},
 	}
 }
@@ -127,7 +128,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	backend := "llama-cpp"
 	if b, ok := preferencesMap["backend"].(string); ok {
 		switch b {
-		case "ik-llama-cpp", "turboquant":
+		case "ik-llama-cpp", "turboquant", "buun-llama-cpp":
 			backend = b
 		}
 	}
--- a/core/gallery/importers/llama-cpp_test.go
+++ b/core/gallery/importers/llama-cpp_test.go
@@ -181,6 +181,23 @@ var _ = Describe("LlamaCPPImporter", func() {
 			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
 		})
 		It("swaps the emitted backend to buun-llama-cpp when preferred", func() {
 			preferences := json.RawMessage(`{"backend": "buun-llama-cpp"}`)
 			details := Details{
 				URI:         "https://example.com/my-model.gguf",
 				Preferences: preferences,
 			}
 			modelConfig, err := importer.Import(details)
 			Expect(err).ToNot(HaveOccurred())
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: buun-llama-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
 			Expect(modelConfig.ConfigFile).NotTo(ContainSubstring("backend: llama-cpp\n"), fmt.Sprintf("Model config: %+v", modelConfig))
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("model: my-model.gguf"), fmt.Sprintf("Model config: %+v", modelConfig))
 			Expect(len(modelConfig.Files)).To(Equal(1))
 			Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
 		})
 		It("keeps backend: llama-cpp for unknown backend preferences", func() {
 			// Unknown backend values must not leak into the emitted YAML —
 			// we only honour the two curated drop-in replacements.
@@ -375,7 +392,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 	})
 	Context("AdditionalBackends", func() {
-		It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
+		It("advertises ik-llama-cpp, turboquant, and buun-llama-cpp as drop-in replacements", func() {
 			entries := importer.AdditionalBackends()
 			names := make([]string, 0, len(entries))
@@ -384,7 +401,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 				names = append(names, e.Name)
 				byName[e.Name] = e
 			}
-			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))
+			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "buun-llama-cpp"))
 			ik := byName["ik-llama-cpp"]
 			Expect(ik.Modality).To(Equal("text"))
@@ -393,6 +410,10 @@ var _ = Describe("LlamaCPPImporter", func() {
 			tq := byName["turboquant"]
 			Expect(tq.Modality).To(Equal("text"))
 			Expect(tq.Description).NotTo(BeEmpty())
 			bn := byName["buun-llama-cpp"]
 			Expect(bn.Modality).To(Equal("text"))
 			Expect(bn.Description).NotTo(BeEmpty())
 		})
 	})
 })
--- a/docs/content/features/text-generation.md
+++ b/docs/content/features/text-generation.md
@@ -631,6 +631,83 @@ The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` fl
 - [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)
 ### buun-llama-cpp (DFlash speculative decoding + TurboQuant/TCQ KV-cache)
 [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) is a fork-of-a-fork: spiritbuun forked `TheTom/llama-cpp-turboquant` (the `turboquant` backend above) and added two independent features on top:
 1. **DFlash** — a block-diffusion speculative decoding scheme that uses a dedicated drafter model (new `DFlashDraftModel` GGUF architecture). On a target/drafter pair it emits a block of tokens per speculation step and can be combined with tree-structured verification ("DDTree") for multi-branch draft expansion.
 2. **TCQ (Trellis-Coded Quantization)** — two additional KV-cache types (`turbo2_tcq`, `turbo3_tcq`) on top of the TurboQuant `turbo2` / `turbo3` / `turbo4` already shipped by the parent fork, delivering 10–44% KL reduction over scalar quantization at 2–3 bits per value.
 Like `turboquant`, this backend shares LocalAI's stock `llama-cpp` gRPC server sources — so any GGUF model that runs on `llama-cpp` also runs on `buun-llama-cpp`. Pick it over `turboquant` specifically when you want DFlash speculative decoding or the newer TCQ KV-cache variants.
 #### Features
 - Drop-in GGUF compatibility with upstream `llama.cpp`.
 - DFlash block-diffusion speculative decoding (CUDA/Metal; no CPU fallback).
 - TurboQuant KV-cache types (`turbo2`, `turbo3`, `turbo4`) inherited from the parent `turboquant` fork, plus buun-exclusive `turbo2_tcq` and `turbo3_tcq` variants.
 - Same feature surface as `llama-cpp`: text generation, embeddings, tool calls, multimodal via mmproj.
 - Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T — but note that DFlash and `turbo*` KV types have no CPU fallback and error at model-load on CPU-only builds.
 #### Setup
 `buun-llama-cpp` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
 ```bash
 local-ai backends install buun-llama-cpp
 ```
 Or pick a specific flavor for your hardware (example tags: `cpu-buun-llama-cpp`, `cuda12-buun-llama-cpp`, `cuda13-buun-llama-cpp`, `rocm-buun-llama-cpp`, `intel-sycl-f16-buun-llama-cpp`, `vulkan-buun-llama-cpp`).
 #### YAML configuration — TCQ KV-cache
 To run a model with TurboQuant/TCQ quantized KV-cache, set the backend and pick a `turbo*` cache type:
 ```yaml
 name: my-model
 backend: buun-llama-cpp
 parameters:
  model: file.gguf
 # Accepted values for the two fork-aware backends include the stock llama.cpp
 # types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1), the TurboQuant types
 # (turbo2, turbo3, turbo4), and the buun-only TCQ variants (turbo2_tcq,
 # turbo3_tcq). turbo3 / turbo4 / turbo*_tcq auto-enable flash_attention.
 cache_type_k: turbo3
 cache_type_v: turbo3_tcq
 context_size: 8192
 ```
 #### YAML configuration — DFlash speculative decoding
 DFlash requires a **dedicated drafter model** in the new `DFlashDraftModel` GGUF architecture. At time of writing the only known public target/drafter pair is [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash).
 ```yaml
 name: qwen3-dflash
 backend: buun-llama-cpp
 parameters:
  # Target model (quantized as usual)
  model: Qwen3.5-27B-Q4_K_M.gguf
 # Drafter model produced by buun's convert_hf_to_gguf.py from the
 # DFlashDraftModel checkpoint. Resolved relative to the models path.
 draft_model: Qwen3.5-27B-DFlash.gguf
 options:
  # Switches the speculative pipeline from the default draft-model mode to
  # DFlash (block-diffusion). Required to activate the DFlash code path.
  - spec_type:dflash
  # Optional tuning:
  # - tree_budget:0      # 0 = flat DFlash; >0 = DDTree verification budget
  # - draft_topk:1       # drafter top-K per position (1 = argmax)
  # - spec_n_max:16      # cap on draft tokens per speculation step
 ```
 Under the hood LocalAI wires `draft_model` through to the grpc-server's `params.speculative.mparams_dft.path`, and `spec_type:dflash` is forwarded through the options passthrough to buun's `common_speculative_type_from_name("dflash")`. The `tree_budget` and `draft_topk` options are buun-exclusive; they reference struct fields that only exist in buun's fork, so they're surfaced on this backend only (passing them to stock `llama-cpp` is a no-op).
 #### Reference
 - [spiritbuun/buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp)
 - [TCQ paper / dataset](https://huggingface.co/datasets/spiritbuun/turboquant-tcq-kv-cache) — *"Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits"*
 - DFlash target/drafter pair: [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash)
 ### vLLM
 [vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.
--- a/docs/content/reference/compatibility-table.md
+++ b/docs/content/reference/compatibility-table.md
@@ -20,6 +20,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
 |---------|-------------|------------|------------|-----------|-------------|
 | [llama.cpp](https://github.com/ggerganov/llama.cpp) | LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, [and many others](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description) | GPT, Functions | yes | yes | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
 | [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) | Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek | GPT | yes | yes | CPU (AVX2+) |
 | [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) | llama.cpp fork with DFlash block-diffusion speculative decoding and TurboQuant/TCQ KV-cache quantization (2–3 bits per value). Accelerated paths are CUDA/Metal only. | GPT, Functions | yes | yes | CUDA, Metal (CPU fallback for non-turbo/non-DFlash only) |
 | [vLLM](https://github.com/vllm-project/vllm) | Fast LLM serving with PagedAttention | GPT, Functions | no | yes | CPU, CUDA 12, ROCm, Intel |
 | [vLLM Omni](https://github.com/vllm-project/vllm) | Unified multimodal generation (text, image, video, audio) | Multimodal GPT, Functions | no | yes | CUDA 12, ROCm |
 | [transformers](https://github.com/huggingface/transformers) | HuggingFace Transformers framework | GPT, Embeddings, Multimodal | yes | yes* | CPU, CUDA 12/13, ROCm, Intel, Metal |
--- a/scripts/changed-backends.js
+++ b/scripts/changed-backends.js
@@ -32,6 +32,12 @@ function inferBackendPath(item) {
    // via a thin wrapper Makefile. Changes to either dir should retrigger it.
    return `backend/cpp/turboquant/`;
  }
  if (item.dockerfile.endsWith("buun-llama-cpp")) {
    // buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
    // TheTom/llama-cpp-turboquant) that reuses backend/cpp/llama-cpp sources
    // the same way turboquant does. Changes to either dir retrigger it.
    return `backend/cpp/buun-llama-cpp/`;
  }
  if (item.dockerfile.endsWith("llama-cpp")) {
    return `backend/cpp/llama-cpp/`;
  }
@@ -138,9 +144,10 @@ async function getChangedFiles() {
  // Per-backend boolean outputs
  for (const [backend, pathPrefix] of allBackendPaths) {
    let changed = changedFiles.some(file => file.startsWith(pathPrefix));
-    // turboquant reuses backend/cpp/llama-cpp sources via a thin wrapper;
+    // turboquant and buun-llama-cpp reuse backend/cpp/llama-cpp sources via
-    // changes to either directory should retrigger its pipeline.
+    // thin wrapper Makefiles; changes to that directory should retrigger
-    if (backend === "turboquant" && !changed) {
+    // their pipelines too.
    if ((backend === "turboquant" || backend === "buun-llama-cpp") && !changed) {
      changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
    }
    fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`);
Author	SHA1	Message	Date
Ettore Di Giacinto	9787bee48b	fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 20:09:36 +00:00
Ettore Di Giacinto	42754d33b9	fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var, laneMask), omitting the optional width parameter. The hipification shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that requires all four arguments, so hipcc fails with: argmax.cu:265: too few arguments provided to function-like macro invocation note: macro '__shfl_xor_sync' defined here: #define __shfl_xor_sync(mask, var, laneMask, width) \ __shfl_xor(var, laneMask, width) Every other call in the same file already passes WARP_SIZE explicitly; aligning these two with that convention fixes the hipblas build without changing CUDA codegen (warpSize is the CUDA default). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 16:29:29 +00:00
Ettore Di Giacinto	7f2b7e4ace	fix(buun-llama-cpp): shim atomicAdd(double,double) for pre-sm_60 CUDA Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls atomicAdd with a double destination. Native double atomicAdd is only available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12 Docker image builds for the full published arch range (which includes sm_50/sm_52), so nvcc fails with: fattn.cu:812: error: no instance of overloaded function "atomicAdd" matches the argument list, argument types are: (double *, double) Add the canonical CAS-loop shim from the CUDA C Programming Guide (B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the guard is false and nvcc picks up the native intrinsic as before. Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is applied to the cloned fork tree by apply-patches.sh (the infrastructure already put in place for exactly this class of backport). Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 13:57:30 +00:00
Ettore Di Giacinto	6233feb190	ci(buun-llama-cpp): wire backend into test-extra + build matrix Adds the buun-llama-cpp backend to the same CI pipelines that turboquant and sherpa-onnx already use: - scripts/changed-backends.js: path resolution for Dockerfile.buun-llama-cpp, plus fork-of-fork detection (changes under backend/cpp/llama-cpp/ also retrigger the buun pipeline, mirroring how turboquant is handled). - .github/workflows/test-extra.yml: detect-changes output and a new tests-buun-llama-cpp-grpc job that runs make test-extra-backend-buun-llama-cpp (turbo3 V-cache, same rationale as tests-turboquant-grpc). - .github/workflows/backend.yml: 9 matrix entries (CUDA 12/13, L4T CUDA 13 ARM64, ROCm, SYCL f32/f16, CPU, L4T ARM64, Vulkan) paired with each existing turboquant entry so image builds have platform parity. Also updates .agents/ai-coding-assistants.md to clarify that AI agents operating under the human submitter's git identity SHOULD emit Signed-off-by via `git commit -s` (never inventing or guessing another identity) — documents the workflow this PR is using. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 12:52:54 +00:00
Ettore Di Giacinto	d6bf3a4969	fix(buun-llama-cpp): drop logit_bias_eog arg from params_from_json_cmpl Previous substitution kept the call as 5 args, but buun predates the upstream refactor that also added the logit_bias_eog parameter to params_from_json_cmpl — buun's signature is still the 4-arg form (const llama_vocab*, const common_params&, int, const json&) and it still derives logit_bias_eog internally from the common_params. Replace the substitution with a line-delete. Guard matches both the original call (ctx_server.get_meta().logit_bias_eog) and the previously substituted form (params_base.sampling.logit_bias_eog) so the script stays safe across re-runs and whatever state the tree was left in. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 12:52:53 +00:00
Ettore Di Giacinto	b27d38a53d	fix(buun-llama-cpp): backport logit_bias_eog field to grpc-server copy LocalAI's shared grpc-server.cpp reaches ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl callsites). That accessor was added to server_context_meta upstream after buun's 2026-04-05 fork-point, so compiling against buun errors with 'struct server_context_meta' has no member named 'logit_bias_eog'. Rewrite the call sites — only in the buun grpc-server.cpp copy — to source the vector from params_base.sampling.logit_bias_eog instead. That vector is the underlying data the upstream meta accessor eventually returns (buun still carries common_params_sampling::logit_bias_eog at common.h:280), so the substitution yields identical behavior on both trees. The sed is guarded by a grep for the call site, so this patch is self-disabling once buun rebases past the upstream refactor. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 12:52:53 +00:00
Ettore Di Giacinto	45756b19dc	test(gallery): extend importer specs to cover buun-llama-cpp Two additions that pair with the new backend: - An Import()-side case that asserts preference buun-llama-cpp produces backend: buun-llama-cpp in the emitted YAML (mirrors the existing ik-llama-cpp and turboquant cases). - AdditionalBackends() spec now asserts all three drop-in replacements are advertised, and verifies buun-llama-cpp's Modality/Description alongside the other two. Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 12:52:53 +00:00
Ettore Di Giacinto	cd6079b2f3	feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache) spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds two independent features on top: DFlash block-diffusion speculative decoding (via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4. Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp grpc-server sources verbatim, patches only the build copy to extend the KV allow-list and wire up buun-exclusive tree_budget / draft_topk options. DraftModel is already wired end-to-end (proto field 39 → params.speculative), so DFlash activation only needs the existing options passthrough (spec_type:dflash) plus the drafter path in draft_model. CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown shows them — benefits turboquant too (previously users had to type them in YAML manually). Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-04-24 12:52:53 +00:00