Compare commits

...

8 Commits

Author SHA1 Message Date
Ettore Di Giacinto
9787bee48b fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles
Two more hipblas-only build failures in buun's fattn.cu, fixed under the
same patches/ infrastructure:

1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration +
   TCQ codebook upload paths call the symbol variants of cudaMemcpy.
   ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy*
   name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the
   symbol pair was never added. 15+ "use of undeclared identifier"
   errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883,
   905, 954, 976, 1449, 1463. Add the two missing aliases alongside
   the existing memcpy block.

2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the
   earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp
   butterfly) and 536 (fwht128_store_half neighbor fetch) drop the
   width argument that hip.h:33 requires. Add WARP_SIZE.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 20:09:36 +00:00
Ettore Di Giacinto
42754d33b9 fix(buun-llama-cpp): pass WARP_SIZE to argmax __shfl_xor_sync calls
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:

    argmax.cu:265: too few arguments provided to function-like macro
      invocation
    note: macro '__shfl_xor_sync' defined here:
      #define __shfl_xor_sync(mask, var, laneMask, width) \
              __shfl_xor(var, laneMask, width)

Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 16:29:29 +00:00
Ettore Di Giacinto
7f2b7e4ace fix(buun-llama-cpp): shim atomicAdd(double*,double) for pre-sm_60 CUDA
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:

    fattn.cu:812: error: no instance of overloaded function "atomicAdd"
    matches the argument list, argument types are: (double *, double)

Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.

Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 13:57:30 +00:00
Ettore Di Giacinto
6233feb190 ci(buun-llama-cpp): wire backend into test-extra + build matrix
Adds the buun-llama-cpp backend to the same CI pipelines that turboquant
and sherpa-onnx already use:

- scripts/changed-backends.js: path resolution for Dockerfile.buun-llama-cpp,
  plus fork-of-fork detection (changes under backend/cpp/llama-cpp/ also
  retrigger the buun pipeline, mirroring how turboquant is handled).
- .github/workflows/test-extra.yml: detect-changes output and a new
  tests-buun-llama-cpp-grpc job that runs make test-extra-backend-buun-llama-cpp
  (turbo3 V-cache, same rationale as tests-turboquant-grpc).
- .github/workflows/backend.yml: 9 matrix entries (CUDA 12/13, L4T CUDA
  13 ARM64, ROCm, SYCL f32/f16, CPU, L4T ARM64, Vulkan) paired with each
  existing turboquant entry so image builds have platform parity.

Also updates .agents/ai-coding-assistants.md to clarify that AI agents
operating under the human submitter's git identity SHOULD emit
Signed-off-by via `git commit -s` (never inventing or guessing another
identity) — documents the workflow this PR is using.

Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:54 +00:00
Ettore Di Giacinto
d6bf3a4969 fix(buun-llama-cpp): drop logit_bias_eog arg from params_from_json_cmpl
Previous substitution kept the call as 5 args, but buun predates the
upstream refactor that also *added* the logit_bias_eog parameter to
params_from_json_cmpl — buun's signature is still the 4-arg form
  (const llama_vocab*, const common_params&, int, const json&)
and it still derives logit_bias_eog internally from the common_params.

Replace the substitution with a line-delete. Guard matches both the
original call (ctx_server.get_meta().logit_bias_eog) and the previously
substituted form (params_base.sampling.logit_bias_eog) so the script
stays safe across re-runs and whatever state the tree was left in.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
b27d38a53d fix(buun-llama-cpp): backport logit_bias_eog field to grpc-server copy
LocalAI's shared grpc-server.cpp reaches
ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl
callsites). That accessor was added to server_context_meta upstream after
buun's 2026-04-05 fork-point, so compiling against buun errors with
  'struct server_context_meta' has no member named 'logit_bias_eog'.

Rewrite the call sites — only in the buun grpc-server.cpp copy — to source
the vector from params_base.sampling.logit_bias_eog instead. That vector is
the underlying data the upstream meta accessor eventually returns (buun
still carries common_params_sampling::logit_bias_eog at common.h:280), so
the substitution yields identical behavior on both trees.

The sed is guarded by a grep for the call site, so this patch is
self-disabling once buun rebases past the upstream refactor.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
45756b19dc test(gallery): extend importer specs to cover buun-llama-cpp
Two additions that pair with the new backend:
- An Import()-side case that asserts preference buun-llama-cpp produces
  backend: buun-llama-cpp in the emitted YAML (mirrors the existing
  ik-llama-cpp and turboquant cases).
- AdditionalBackends() spec now asserts all three drop-in replacements
  are advertised, and verifies buun-llama-cpp's Modality/Description
  alongside the other two.

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
Ettore Di Giacinto
cd6079b2f3 feat(backend): add buun-llama-cpp fork (DFlash + TCQ KV-cache)
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds
two independent features on top: DFlash block-diffusion speculative decoding
(via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache
variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4.

Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp
grpc-server sources verbatim, patches only the build copy to extend the KV
allow-list and wire up buun-exclusive tree_budget / draft_topk options.
DraftModel is already wired end-to-end (proto field 39 → params.speculative),
so DFlash activation only needs the existing options passthrough
(spec_type:dflash) plus the drafter path in draft_model.

CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown
shows them — benefits turboquant too (previously users had to type them in
YAML manually).

Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-04-24 12:52:53 +00:00
20 changed files with 1160 additions and 17 deletions

View File

@@ -35,19 +35,33 @@ All contributions must comply with LocalAI's licensing requirements:
## Signed-off-by and Developer Certificate of Origin
**AI agents MUST NOT add `Signed-off-by` tags.** Only humans can legally
certify the Developer Certificate of Origin (DCO). The human submitter
is responsible for:
Only humans can certify the Developer Certificate of Origin (DCO). AI
agents MUST NOT invent or guess a human identity for `Signed-off-by`
doing so forges the DCO certification.
- Reviewing all AI-generated code
However, when a human operator explicitly directs the AI to commit on
their behalf, the AI is acting as a typing tool — no different from an
editor macro or `git commit -s`. In that case the AI SHOULD add
`Signed-off-by:` using the **configured `user.name` / `user.email`** of
the current git repository (i.e. the operator's own identity). The
resulting trailer is the operator's signature; they take responsibility
for it by reviewing and pushing the commit. The AI MUST NOT use any
other identity and MUST NOT add its own name to the sign-off.
When running `git commit`, prefer `git commit --signoff` (or `-s`) so
the trailer is emitted by git itself from the configured identity,
rather than hand-writing it in a heredoc — this guarantees the sign-off
matches whatever identity the operator is currently using.
The human submitter remains responsible for:
- Reviewing all AI-generated code before it's pushed or merged
- Ensuring compliance with licensing requirements
- Adding their own `Signed-off-by` tag (when the project requires DCO)
to certify the contribution
- Taking full responsibility for the contribution
AI agents MUST NOT add `Co-Authored-By` trailers for themselves either.
A human reviewer owns the contribution; the AI's involvement is recorded
via `Assisted-by` (see below).
AI agents MUST NOT add `Co-Authored-By` trailers for themselves. A human
reviewer owns the contribution; the AI's involvement is recorded via
`Assisted-by` (see below).
## Attribution
@@ -84,6 +98,12 @@ Assisted-by: Claude:claude-opus-4-7 golangci-lint
Signed-off-by: Jane Developer <jane@example.com>
```
The `Signed-off-by` line uses Jane's own identity because Jane is the
submitter operating the AI. If Jane asks Claude to create the commit via
`git commit -s`, git emits that exact trailer from Jane's configured
identity — no separate human step is needed beyond Jane reviewing the
diff before pushing.
## Scope and Responsibility
Using an AI assistant does not reduce the contributor's responsibility.

View File

@@ -399,6 +399,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-12-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "8"
@@ -894,6 +907,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -920,6 +946,19 @@ jobs:
backend: "turboquant"
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-buun-llama-cpp'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
@@ -1454,6 +1493,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-rocm-hipblas-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "rocm/dev-ubuntu-24.04:7.2.1"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'hipblas'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1703,6 +1755,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f32'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f32-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
@@ -1729,6 +1794,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'sycl_f16'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-intel-sycl-f16-buun-llama-cpp'
runs-on: 'ubuntu-latest'
base-image: "intel/oneapi-basekit:2025.3.0-0-devel-ubuntu24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: 'intel'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2134,6 +2212,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-cpu-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
- build-type: ''
cuda-major-version: ""
cuda-minor-version: ""
@@ -2173,6 +2264,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2204'
- build-type: 'cublas'
cuda-major-version: "12"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-arm64-buun-llama-cpp'
base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
runs-on: 'ubuntu-24.04-arm'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2204'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
@@ -2199,6 +2303,19 @@ jobs:
dockerfile: "./backend/Dockerfile.turboquant"
context: "./"
ubuntu-version: '2404'
- build-type: 'vulkan'
cuda-major-version: ""
cuda-minor-version: ""
platforms: 'linux/amd64,linux/arm64'
tag-latest: 'auto'
tag-suffix: '-gpu-vulkan-buun-llama-cpp'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "buun-llama-cpp"
dockerfile: "./backend/Dockerfile.buun-llama-cpp"
context: "./"
ubuntu-version: '2404'
# Stablediffusion-ggml
- build-type: ''
cuda-major-version: ""

View File

@@ -32,6 +32,7 @@ jobs:
llama-cpp: ${{ steps.detect.outputs.llama-cpp }}
ik-llama-cpp: ${{ steps.detect.outputs.ik-llama-cpp }}
turboquant: ${{ steps.detect.outputs.turboquant }}
buun-llama-cpp: ${{ steps.detect.outputs['buun-llama-cpp'] }}
vllm: ${{ steps.detect.outputs.vllm }}
sglang: ${{ steps.detect.outputs.sglang }}
acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
@@ -613,6 +614,30 @@ jobs:
- name: Build turboquant backend image and run gRPC e2e tests
run: |
make test-extra-backend-turboquant
tests-buun-llama-cpp-grpc:
needs: detect-changes
if: needs.detect-changes.outputs['buun-llama-cpp'] == 'true' || needs.detect-changes.outputs.run-all == 'true'
runs-on: ubuntu-latest
timeout-minutes: 90
steps:
- name: Clone
uses: actions/checkout@v6
with:
submodules: true
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.25.4'
# Exercises the buun-llama-cpp (fork-of-a-fork) backend with the
# fork-specific TurboQuant/TCQ KV-cache types. BACKEND_TEST_CACHE_TYPE_V
# is set to turbo3 so the test round-trips through the fork's KV
# allow-list — picking a stock llama.cpp type would only re-test the
# shared code path. DFlash speculative decoding is not exercised here
# because the one known public target/drafter pair (Qwen3.5-27B) is too
# large for CI.
- name: Build buun-llama-cpp backend image and run gRPC e2e tests
run: |
make test-extra-backend-buun-llama-cpp
# tests-vllm-grpc is currently disabled in CI.
#
# The prebuilt vllm CPU wheel is compiled with AVX-512 VNNI/BF16

View File

@@ -1,5 +1,5 @@
# Disable parallel execution for backend builds
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/buun-llama-cpp backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/tinygrad backends/sherpa-onnx
GOCMD=go
GOTEST=$(GOCMD) test
@@ -545,6 +545,19 @@ test-extra-backend-turboquant: docker-build-turboquant
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## buun-llama-cpp: exercises the fork-of-a-fork backend (spiritbuun/buun-llama-cpp)
## with the *TurboQuant/TCQ-specific* KV-cache types (turbo3 for V). Same rationale
## as turboquant above: picking a standard llama.cpp type would only re-test the
## shared code path. buun inherits turboquant's turbo2/turbo3/turbo4 and adds
## turbo2_tcq / turbo3_tcq on top. DFlash speculative decoding is not exercised
## here because no small DFlash drafter model exists (the known public pair is
## Qwen3.5-27B, ~54 GB).
test-extra-backend-buun-llama-cpp: docker-build-buun-llama-cpp
BACKEND_IMAGE=local-ai-backend:buun-llama-cpp \
BACKEND_TEST_CACHE_TYPE_K=q8_0 \
BACKEND_TEST_CACHE_TYPE_V=turbo3 \
$(MAKE) test-extra-backend
## Audio transcription wrapper for the llama-cpp backend.
## Drives the new AudioTranscription / AudioTranscriptionStream RPCs against
## ggml-org/Qwen3-ASR-0.6B-GGUF (a small ASR model that requires its mmproj
@@ -949,6 +962,11 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
# turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
# Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
# buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
# TheTom/llama-cpp-turboquant) that adds DFlash block-diffusion speculative
# decoding and extra TCQ KV-cache variants on top of TurboQuant. Same thin
# wrapper pattern as turboquant — reuses backend/cpp/llama-cpp grpc-server.
BACKEND_BUUN_LLAMA_CPP = buun-llama-cpp|buun-llama-cpp|.|false|false
# Golang backends
BACKEND_PIPER = piper|golang|.|false|true
@@ -1029,6 +1047,7 @@ endef
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
$(eval $(call generate-docker-build-target,$(BACKEND_BUUN_LLAMA_CPP)))
$(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
$(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
$(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
@@ -1080,7 +1099,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
docker-save-%: backend-images
docker save local-ai-backend:$* -o backend-images/$*.tar
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-buun-llama-cpp docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
########################################################
### Mock Backend for E2E Tests

View File

@@ -0,0 +1,290 @@
ARG BASE_IMAGE=ubuntu:24.04
ARG GRPC_BASE_IMAGE=${BASE_IMAGE}
# The grpc target does one thing, it builds and installs GRPC. This is in it's own layer so that it can be effectively cached by CI.
# You probably don't need to change anything here, and if you do, make sure that CI is adjusted so that the cache continues to work.
FROM ${GRPC_BASE_IMAGE} AS grpc
# This is a bit of a hack, but it's required in order to be able to effectively cache this layer in CI
ARG GRPC_MAKEFLAGS="-j4 -Otarget"
ARG GRPC_VERSION=v1.65.0
ARG CMAKE_FROM_SOURCE=false
# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
ARG CMAKE_VERSION=3.31.10
ENV MAKEFLAGS=${GRPC_MAKEFLAGS}
WORKDIR /build
RUN apt-get update && \
apt-get install -y --no-install-recommends \
ca-certificates \
build-essential curl libssl-dev \
git wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# We install GRPC to a different prefix here so that we can copy in only the build artifacts later
# saves several hundred MB on the final docker image size vs copying in the entire GRPC source tree
# and running make install in the target container
RUN git clone --recurse-submodules --jobs 4 -b ${GRPC_VERSION} --depth 1 --shallow-submodules https://github.com/grpc/grpc && \
mkdir -p /build/grpc/cmake/build && \
cd /build/grpc/cmake/build && \
sed -i "216i\ TESTONLY" "../../third_party/abseil-cpp/absl/container/CMakeLists.txt" && \
cmake -DgRPC_INSTALL=ON -DgRPC_BUILD_TESTS=OFF -DCMAKE_INSTALL_PREFIX:PATH=/opt/grpc ../.. && \
make && \
make install && \
rm -rf /build
FROM ${BASE_IMAGE} AS builder
ARG CMAKE_FROM_SOURCE=false
ARG CMAKE_VERSION=3.31.10
# We can target specific CUDA ARCHITECTURES like --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
ARG CUDA_DOCKER_ARCH
ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
ARG CMAKE_ARGS
ENV CMAKE_ARGS=${CMAKE_ARGS}
ARG BACKEND=rerankers
ARG BUILD_TYPE
ENV BUILD_TYPE=${BUILD_TYPE}
ARG CUDA_MAJOR_VERSION
ARG CUDA_MINOR_VERSION
ARG SKIP_DRIVERS=false
ENV CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION}
ENV CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION}
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETARCH
ARG TARGETVARIANT
ARG GO_VERSION=1.25.4
ARG UBUNTU_VERSION=2404
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
ccache git \
ca-certificates \
make \
pkg-config libcurl4-openssl-dev \
curl unzip \
libssl-dev wget && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Cuda
ENV PATH=/usr/local/cuda/bin:${PATH}
# HipBLAS requirements
ENV PATH=/opt/rocm/bin:${PATH}
# Vulkan requirements
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "vulkan" ] && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils wget gpg-agent && \
apt-get install -y libglm-dev cmake libxcb-dri3-0 libxcb-present0 libpciaccess0 \
libpng-dev libxcb-keysyms1-dev libxcb-dri3-dev libx11-dev g++ gcc \
libwayland-dev libxrandr-dev libxcb-randr0-dev libxcb-ewmh-dev \
git python-is-python3 bison libx11-xcb-dev liblz4-dev libzstd-dev \
ocaml-core ninja-build pkg-config libxml2-dev wayland-protocols python3-jsonschema \
clang-format qtbase5-dev qt6-base-dev libxcb-glx0-dev sudo xz-utils
if [ "amd64" = "$TARGETARCH" ]; then
wget "https://sdk.lunarg.com/sdk/download/1.4.335.0/linux/vulkansdk-linux-x86_64-1.4.335.0.tar.xz" && \
tar -xf vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
rm vulkansdk-linux-x86_64-1.4.335.0.tar.xz && \
mkdir -p /opt/vulkan-sdk && \
mv 1.4.335.0 /opt/vulkan-sdk/ && \
cd /opt/vulkan-sdk/1.4.335.0 && \
./vulkansdk --no-deps --maxjobs \
vulkan-loader \
vulkan-validationlayers \
vulkan-extensionlayer \
vulkan-tools \
shaderc && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/bin/* /usr/bin/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/lib/* /usr/lib/x86_64-linux-gnu/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/include/* /usr/include/ && \
cp -rfv /opt/vulkan-sdk/1.4.335.0/x86_64/share/* /usr/share/ && \
rm -rf /opt/vulkan-sdk
fi
if [ "arm64" = "$TARGETARCH" ]; then
mkdir vulkan && cd vulkan && \
curl -L -o vulkan-sdk.tar.xz https://github.com/mudler/vulkan-sdk-arm/releases/download/1.4.335.0/vulkansdk-ubuntu-24.04-arm-1.4.335.0.tar.xz && \
tar -xvf vulkan-sdk.tar.xz && \
rm vulkan-sdk.tar.xz && \
cd 1.4.335.0 && \
cp -rfv aarch64/bin/* /usr/bin/ && \
cp -rfv aarch64/lib/* /usr/lib/aarch64-linux-gnu/ && \
cp -rfv aarch64/include/* /usr/include/ && \
cp -rfv aarch64/share/* /usr/share/ && \
cd ../.. && \
rm -rf vulkan
fi
ldconfig && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# CuBLAS requirements
RUN <<EOT bash
if ( [ "${BUILD_TYPE}" = "cublas" ] || [ "${BUILD_TYPE}" = "l4t" ] ) && [ "${SKIP_DRIVERS}" = "false" ]; then
apt-get update && \
apt-get install -y --no-install-recommends \
software-properties-common pciutils
if [ "amd64" = "$TARGETARCH" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/x86_64/cuda-keyring_1.1-1_all.deb
fi
if [ "arm64" = "$TARGETARCH" ]; then
if [ "${CUDA_MAJOR_VERSION}" = "13" ]; then
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/sbsa/cuda-keyring_1.1-1_all.deb
else
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${UBUNTU_VERSION}/arm64/cuda-keyring_1.1-1_all.deb
fi
fi
dpkg -i cuda-keyring_1.1-1_all.deb && \
rm -f cuda-keyring_1.1-1_all.deb && \
apt-get update && \
apt-get install -y --no-install-recommends \
cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusparse-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
libcusolver-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
if [ "${CUDA_MAJOR_VERSION}" = "13" ] && [ "arm64" = "$TARGETARCH" ]; then
apt-get install -y --no-install-recommends \
libcufile-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libcudnn9-cuda-${CUDA_MAJOR_VERSION} cuda-cupti-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} libnvjitlink-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION}
fi
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
# https://github.com/NVIDIA/Isaac-GR00T/issues/343
RUN <<EOT bash
if [ "${BUILD_TYPE}" = "cublas" ] && [ "${TARGETARCH}" = "arm64" ]; then
wget https://developer.download.nvidia.com/compute/cudss/0.6.0/local_installers/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
dpkg -i cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0_0.6.0-1_arm64.deb && \
cp /var/cudss-local-tegra-repo-ubuntu${UBUNTU_VERSION}-0.6.0/cudss-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get -y install cudss cudss-cuda-${CUDA_MAJOR_VERSION} && \
wget https://developer.download.nvidia.com/compute/nvpl/25.5/local_installers/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
dpkg -i nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5_1.0-1_arm64.deb && \
cp /var/nvpl-local-repo-ubuntu${UBUNTU_VERSION}-25.5/nvpl-*-keyring.gpg /usr/share/keyrings/ && \
apt-get update && apt-get install -y nvpl
fi
EOT
# If we are building with clblas support, we need the libraries for the builds
RUN if [ "${BUILD_TYPE}" = "clblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
libclblast-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* \
; fi
RUN if [ "${BUILD_TYPE}" = "hipblas" ] && [ "${SKIP_DRIVERS}" = "false" ]; then \
apt-get update && \
apt-get install -y --no-install-recommends \
hipblas-dev \
rocblas-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
# I have no idea why, but the ROCM lib packages don't trigger ldconfig after they install, which results in local-ai and others not being able
# to locate the libraries. We run ldconfig ourselves to work around this packaging deficiency
ldconfig && \
# Log which GPU architectures have rocBLAS kernel support
echo "rocBLAS library data architectures:" && \
(ls /opt/rocm*/lib/rocblas/library/Kernels* 2>/dev/null || ls /opt/rocm*/lib64/rocblas/library/Kernels* 2>/dev/null) | grep -oP 'gfx[0-9a-z+-]+' | sort -u || \
echo "WARNING: No rocBLAS kernel data found" \
; fi
RUN echo "TARGETARCH: $TARGETARCH"
# We need protoc installed, and the version in 22.04 is too old. We will create one as part installing the GRPC build below
# but that will also being in a newer version of absl which stablediffusion cannot compile with. This version of protoc is only
# here so that we can generate the grpc code for the stablediffusion build
RUN <<EOT bash
if [ "amd64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-x86_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
if [ "arm64" = "$TARGETARCH" ]; then
curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v27.1/protoc-27.1-linux-aarch_64.zip -o protoc.zip && \
unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
rm protoc.zip
fi
EOT
# Install CMake (the version in 22.04 is too old)
RUN <<EOT bash
if [ "${CMAKE_FROM_SOURCE}" = "true" ]; then
curl -L -s https://github.com/Kitware/CMake/releases/download/v${CMAKE_VERSION}/cmake-${CMAKE_VERSION}.tar.gz -o cmake.tar.gz && tar xvf cmake.tar.gz && cd cmake-${CMAKE_VERSION} && ./configure && make && make install
else
apt-get update && \
apt-get install -y \
cmake && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
fi
EOT
COPY --from=grpc /opt/grpc /usr/local
COPY . /LocalAI
RUN <<'EOT' bash
set -euxo pipefail
if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/buun-llama-cpp-*-build
fi
cd /LocalAI/backend/cpp/buun-llama-cpp
if [ "${TARGETARCH}" = "arm64" ] || [ "${BUILD_TYPE}" = "hipblas" ]; then
make buun-llama-cpp-fallback
make buun-llama-cpp-grpc
make buun-llama-cpp-rpc-server
else
make buun-llama-cpp-avx
make buun-llama-cpp-avx2
make buun-llama-cpp-avx512
make buun-llama-cpp-fallback
make buun-llama-cpp-grpc
make buun-llama-cpp-rpc-server
fi
EOT
# Copy libraries using a script to handle architecture differences
RUN make -BC /LocalAI/backend/cpp/buun-llama-cpp package
FROM scratch
# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
COPY --from=builder /LocalAI/backend/cpp/buun-llama-cpp/package/. ./

View File

@@ -0,0 +1,85 @@
# Pinned to the HEAD of master on https://github.com/spiritbuun/buun-llama-cpp.
# Auto-bumped nightly by .github/workflows/bump_deps.yaml.
BUUN_LLAMA_VERSION?=22464d0848b87c5d56b52fdf6af2e5da46bf803e
LLAMA_REPO?=https://github.com/spiritbuun/buun-llama-cpp
CMAKE_ARGS?=
BUILD_TYPE?=
NATIVE?=false
ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
TARGET?=--target grpc-server
JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
ARCH?=$(shell uname -m)
CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
GREEN := \033[0;32m
RESET := \033[0m
# buun-llama-cpp is a llama.cpp fork-of-a-fork (spiritbuun/buun-llama-cpp forked
# TheTom/llama-cpp-turboquant, which itself forked ggml-org/llama.cpp). Rather
# than duplicating grpc-server.cpp / CMakeLists.txt / prepare.sh we reuse the
# ones in backend/cpp/llama-cpp, and only swap which repo+sha the fetch step
# pulls. Each flavor target copies ../llama-cpp into a sibling
# ../buun-llama-cpp-<flavor>-build directory, then invokes llama-cpp's own
# build-llama-cpp-grpc-server with LLAMA_REPO/LLAMA_VERSION overridden to point
# at the fork.
PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches
# Each flavor target:
# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + CMakeLists.txt + Makefile)
# into a sibling buun-llama-cpp-<flavor>-build directory;
# 2. clones the buun fork into buun-llama-cpp-<flavor>-build/llama.cpp via the
# copy's own `llama.cpp` target, overriding LLAMA_REPO/LLAMA_VERSION;
# 3. applies patches from backend/cpp/buun-llama-cpp/patches/ to the cloned
# fork sources (for backporting upstream commits the fork hasn't pulled);
# 4. runs the copy's `grpc-server` target, which produces the binary we copy
# up as buun-llama-cpp-<flavor>.
define buun-llama-cpp-build
rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build purge
# Augment the copied grpc-server.cpp's KV-cache allow-list with the
# fork's turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq types and wire up the
# DFlash-specific option handlers (tree_budget / draft_topk). We patch the
# *copy*, never the original under backend/cpp/llama-cpp/, so the stock
# llama-cpp build stays compiling against vanilla upstream.
bash $(CURRENT_MAKEFILE_DIR)/patch-grpc-server.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server.cpp
$(info $(GREEN)I buun-llama-cpp build info:$(1)$(RESET))
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build llama.cpp
bash $(CURRENT_MAKEFILE_DIR)/apply-patches.sh $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/llama.cpp $(PATCHES_DIR)
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" \
LLAMA_REPO=$(LLAMA_REPO) LLAMA_VERSION=$(BUUN_LLAMA_VERSION) \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-$(1)-build/grpc-server buun-llama-cpp-$(1)
endef
buun-llama-cpp-avx2:
$(call buun-llama-cpp-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
buun-llama-cpp-avx512:
$(call buun-llama-cpp-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
buun-llama-cpp-avx:
$(call buun-llama-cpp-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
buun-llama-cpp-fallback:
$(call buun-llama-cpp-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
buun-llama-cpp-grpc:
$(call buun-llama-cpp-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
buun-llama-cpp-rpc-server: buun-llama-cpp-grpc
cp -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-grpc-build/llama.cpp/build/bin/rpc-server buun-llama-cpp-rpc-server
package:
bash package.sh
purge:
rm -rf $(CURRENT_MAKEFILE_DIR)/../buun-llama-cpp-*-build
rm -rf buun-llama-cpp-* package
clean: purge

View File

@@ -0,0 +1,50 @@
#!/bin/bash
# Apply the buun-llama-cpp patch series to a cloned buun-llama-cpp checkout.
#
# buun-llama-cpp is a fork-of-a-fork that branched off upstream llama.cpp
# before some API changes the shared backend/cpp/llama-cpp/grpc-server.cpp
# depends on. We carry those upstream commits as patch files under
# backend/cpp/buun-llama-cpp/patches/ and apply them here so the reused
# grpc-server source compiles against the fork unmodified.
#
# Drop the corresponding patch from patches/ whenever the fork catches up with
# upstream — the build will fail fast if a patch stops applying, which is the
# signal to retire it.
set -euo pipefail
if [[ $# -ne 2 ]]; then
echo "usage: $0 <llama.cpp-src-dir> <patches-dir>" >&2
exit 2
fi
SRC_DIR=$1
PATCHES_DIR=$2
if [[ ! -d "$SRC_DIR" ]]; then
echo "source dir does not exist: $SRC_DIR" >&2
exit 2
fi
if [[ ! -d "$PATCHES_DIR" ]]; then
echo "no patches dir at $PATCHES_DIR, nothing to apply"
exit 0
fi
shopt -s nullglob
patches=("$PATCHES_DIR"/*.patch)
shopt -u nullglob
if [[ ${#patches[@]} -eq 0 ]]; then
echo "no .patch files in $PATCHES_DIR, nothing to apply"
exit 0
fi
cd "$SRC_DIR"
for patch in "${patches[@]}"; do
echo "==> applying $patch"
git apply --verbose "$patch"
done
echo "all buun-llama-cpp patches applied successfully"

View File

@@ -0,0 +1,57 @@
#!/bin/bash
# Script to copy the appropriate libraries based on architecture
# This script is used in the final stage of the Dockerfile
set -e
CURDIR=$(dirname "$(realpath $0)")
REPO_ROOT="${CURDIR}/../../.."
# Create lib directory
mkdir -p $CURDIR/package/lib
cp -avrf $CURDIR/buun-llama-cpp-* $CURDIR/package/
cp -rfv $CURDIR/run.sh $CURDIR/package/
# Detect architecture and copy appropriate libraries
if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
# x86_64 architecture
echo "Detected x86_64 architecture, copying x86_64 libraries..."
cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
# ARM64 architecture
echo "Detected ARM64 architecture, copying ARM64 libraries..."
cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
else
echo "Error: Could not detect architecture"
exit 1
fi
# Package GPU libraries based on BUILD_TYPE
GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
if [ -f "$GPU_LIB_SCRIPT" ]; then
echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
package_gpu_libs
fi
echo "Packaging completed successfully"
ls -liah $CURDIR/package/
ls -liah $CURDIR/package/lib/

View File

@@ -0,0 +1,162 @@
#!/bin/bash
# Patch the shared backend/cpp/llama-cpp/grpc-server.cpp *copy* used by the
# buun-llama-cpp build to account for three gaps between upstream and the fork:
#
# 1. Augment the kv_cache_types[] allow-list so `LoadModel` accepts the
# fork-specific `turbo2` / `turbo3` / `turbo4` cache types plus the buun
# additions `turbo2_tcq` / `turbo3_tcq`.
#
# 2. Wire up buun-exclusive speculative-decoding option handlers
# (tree_budget / draft_topk) alongside the existing spec_* handlers.
# These reference struct fields (common_params.speculative.tree_budget
# and .draft_topk) that only exist in buun's common/common.h — adding
# them to the shared backend/cpp/llama-cpp/grpc-server.cpp would break
# the stock llama-cpp build, so we inject them only into the buun copy.
#
# 3. Replace `get_media_marker()` (added upstream in ggml-org/llama.cpp#21962,
# server-side random per-instance marker) with the legacy "<__media__>"
# literal. The fork branched before that PR, so server-common.cpp has no
# get_media_marker symbol. The fork's mtmd_default_marker() still returns
# "<__media__>", and Go-side tooling falls back to that sentinel when the
# backend does not expose media_marker, so substituting the literal keeps
# behavior identical on the buun path.
#
# We patch the *copy* sitting in buun-llama-cpp-<flavor>-build/, never the
# original under backend/cpp/llama-cpp/, so the stock llama-cpp build keeps
# compiling against vanilla upstream.
#
# Idempotent: skips each insertion if its marker is already present (so re-runs
# of the same build dir don't double-insert).
set -euo pipefail
if [[ $# -ne 1 ]]; then
echo "usage: $0 <grpc-server.cpp>" >&2
exit 2
fi
SRC=$1
if [[ ! -f "$SRC" ]]; then
echo "grpc-server.cpp not found at $SRC" >&2
exit 2
fi
if grep -q 'GGML_TYPE_TURBO2_TCQ' "$SRC"; then
echo "==> $SRC already has buun cache types, skipping KV allow-list patch"
else
echo "==> patching $SRC to allow turbo2/turbo3/turbo4/turbo2_tcq/turbo3_tcq KV-cache types"
# Insert the five TURBO entries right after the first ` GGML_TYPE_Q5_1,`
# line (the kv_cache_types[] allow-list). Using awk because the builder
# image does not ship python3, and GNU sed's multi-line `a\` quoting is
# awkward.
awk '
/^ GGML_TYPE_Q5_1,$/ && !done {
print
print " // buun-llama-cpp fork extras — added by patch-grpc-server.sh"
print " GGML_TYPE_TURBO2_0,"
print " GGML_TYPE_TURBO3_0,"
print " GGML_TYPE_TURBO4_0,"
print " GGML_TYPE_TURBO2_TCQ,"
print " GGML_TYPE_TURBO3_TCQ,"
done = 1
next
}
{ print }
END {
if (!done) {
print "patch-grpc-server.sh: anchor ` GGML_TYPE_Q5_1,` not found" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> KV allow-list patch OK"
fi
if grep -q 'optname, "tree_budget"' "$SRC"; then
echo "==> $SRC already has DFlash option handlers, skipping"
else
echo "==> patching $SRC to add tree_budget / draft_topk option handlers"
# Insert two new `else if` handlers between the inner close-brace of the
# `spec_p_split` block and the next `} else if (…spec_ngram_size_n…)` line.
# Upstream writes each `} else if` as a single physical line, so we don't
# emit an outer `}` ourselves — the existing next line provides both the
# close of our `draft_topk` block and the open of `spec_ngram_size_n`.
# Anchor on the exact 3-line body of spec_p_split so we can't drift.
awk '
prev2 == " } else if (!strcmp(optname, \"spec_p_split\")) {" &&
prev1 ~ /^ +if \(optval != NULL\) \{$/ &&
$0 ~ /^ +try \{ params\.speculative\.p_split = std::stof\(optval_str\); \} catch \(\.\.\.\) \{\}$/ &&
!done {
print # print the try-line itself
getline inner_close # read " }" closing the inner if
print inner_close # print it — this closes spec_p_split body
print " // buun-llama-cpp DFlash options — added by patch-grpc-server.sh"
print " } else if (!strcmp(optname, \"tree_budget\")) {"
print " if (optval != NULL) {"
print " try { params.speculative.tree_budget = std::stoi(optval_str); } catch (...) {}"
print " }"
print " } else if (!strcmp(optname, \"draft_topk\")) {"
print " if (optval != NULL) {"
print " try { params.speculative.draft_topk = std::stoi(optval_str); } catch (...) {}"
print " }"
# The next source line (`} else if (…spec_ngram_size_n…) {`) closes
# our draft_topk block and continues the chain naturally; fall back
# into the main loop to emit it and everything after.
done = 1
prev2 = prev1
prev1 = inner_close
next
}
{ print; prev2 = prev1; prev1 = $0 }
END {
if (!done) {
print "patch-grpc-server.sh: spec_p_split anchor not found" > "/dev/stderr"
exit 1
}
}
' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> DFlash option-handler patch OK"
fi
if grep -qE 'ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog,' "$SRC"; then
echo "==> patching $SRC to drop the logit_bias_eog arg from params_from_json_cmpl() callsites (buun still uses the pre-refactor 4-arg signature)"
# Upstream llama.cpp refactored params_from_json_cmpl to take a precomputed
# logit_bias_eog vector after buun's 2026-04-05 fork-point — simultaneously
# adding server_context_meta::logit_bias_eog as the supplier. Buun carries
# neither change: its params_from_json_cmpl is still 4-arg, and internally
# derives logit_bias_eog from the common_params it's passed. So we just
# delete the argument line entirely — the remaining 4 args match buun's
# signature and the resulting behavior matches upstream bit-for-bit
# (upstream's 5th arg is the same data buun derives internally).
#
# Guard is broad so this works whether the line has been run through this
# block before (leaving params_base.sampling.logit_bias_eog,) or not
# (leaving the original ctx_server.get_meta().logit_bias_eog,).
sed -E '/^[[:space:]]+(ctx_server\.get_meta\(\)\.logit_bias_eog|params_base\.sampling\.logit_bias_eog),$/d' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> logit_bias_eog arg drop OK"
else
echo "==> $SRC has no logit_bias_eog arg line, skipping"
fi
if grep -q 'get_media_marker()' "$SRC"; then
echo "==> patching $SRC to replace get_media_marker() with legacy \"<__media__>\" literal"
# Only one call site today (ModelMetadata), but replace all occurrences to
# stay robust if upstream adds more. Use a temp file to avoid relying on
# sed -i portability (the builder image uses GNU sed, but keeping this
# consistent with the awk block above).
sed 's/get_media_marker()/"<__media__>"/g' "$SRC" > "$SRC.tmp"
mv "$SRC.tmp" "$SRC"
echo "==> get_media_marker() substitution OK"
else
echo "==> $SRC has no get_media_marker() call, skipping media-marker patch"
fi
echo "==> all patches applied"

View File

@@ -0,0 +1,46 @@
Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
but native double atomicAdd is only available on compute capability 6.0
and newer. Compiling against a CUDA arch list that includes older
architectures (LocalAI's CUDA 12 Docker image builds for the full
published arch range) fails with:
fattn.cu(812): error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
intrinsic is used and the shim is elided via __CUDA_ARCH__.
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -7,6 +7,27 @@
#include <atomic>
+// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
+// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
+// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
+// atomicAdd with a double*; without this definition, nvcc fails to find a
+// matching overload whenever the compile target list includes pre-sm_60
+// architectures. The standard CAS loop implementation below matches the
+// semantics of the native intrinsic.
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
+static __device__ double atomicAdd(double * address, double val) {
+ unsigned long long int * address_as_ull = (unsigned long long int *)address;
+ unsigned long long int old = *address_as_ull;
+ unsigned long long int assumed;
+ do {
+ assumed = old;
+ old = atomicCAS(address_as_ull, assumed,
+ __double_as_longlong(val + __longlong_as_double(assumed)));
+ } while (assumed != old);
+ return __longlong_as_double(old);
+}
+#endif
+
// InnerQ: update the fattn-side inverse scale array from host (all devices)
void turbo_innerq_update_fattn_scales(const float * scale_inv) {
int cur_device;

View File

@@ -0,0 +1,32 @@
Subject: [PATCH] ggml-cuda/argmax: pass WARP_SIZE to the top-K __shfl_xor_sync calls
Two __shfl_xor_sync calls in the top-K intra-warp merge drop the `width`
argument and rely on the CUDA default (warpSize). Every other call in
the same file already passes WARP_SIZE explicitly, and the HIP/ROCm
compatibility shim at ggml/src/ggml-cuda/vendors/hip.h:33 is a 4-arg
function-like macro — so the 3-arg form fails to preprocess when
building with hipcc against ROCm:
argmax.cu:265: error: too few arguments provided to function-like
macro invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Align the two call sites with the rest of the file by passing WARP_SIZE
explicitly. On CUDA the generated code is unchanged (warpSize is the
default); on HIP it now matches the macro's arity.
--- a/ggml/src/ggml-cuda/argmax.cu
+++ b/ggml/src/ggml-cuda/argmax.cu
@@ -262,8 +262,8 @@
// Each step: lane gets partner's min element, if it beats our min, replace and re-heapify
for (int offset = WARP_SIZE / 2; offset > 0; offset >>= 1) {
for (int i = 0; i < K; i++) {
- float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset);
- int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset);
+ float partner_val = __shfl_xor_sync(0xFFFFFFFF, heap_val[i], offset, WARP_SIZE);
+ int partner_idx = __shfl_xor_sync(0xFFFFFFFF, heap_idx[i], offset, WARP_SIZE);
if (partner_val > heap_val[0]) {
heap_val[0] = partner_val;
heap_idx[0] = partner_idx;

View File

@@ -0,0 +1,24 @@
Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
missing the symbol variants. Building with hipcc therefore fails with
15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
Add the two missing aliases alongside the existing memcpy block. HIP
provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
equivalents, so this is a straight name substitution.
--- a/ggml/src/ggml-cuda/vendors/hip.h
+++ b/ggml/src/ggml-cuda/vendors/hip.h
@@ -85,6 +85,8 @@
#define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
#define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
#define cudaMemcpyHostToDevice hipMemcpyHostToDevice
+#define cudaMemcpyToSymbol hipMemcpyToSymbol
+#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
#define cudaMemcpyKind hipMemcpyKind
#define cudaMemset hipMemset
#define cudaMemsetAsync hipMemsetAsync

View File

@@ -0,0 +1,36 @@
Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
use the 3-arg CUDA form and omit the `width` argument that the HIP
function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
fattn.cu:512: too few arguments provided to function-like macro
invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
default); HIP now matches the macro arity.
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -509,7 +509,7 @@
// Intra-warp passes: shuffle xor with stride h, no smem, no sync.
#pragma unroll
for (int h = 1; h <= 16; h *= 2) {
- const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
+ const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
val = (tid & h) ? (other - val) : (val + other);
}
@@ -533,7 +533,7 @@
static __device__ __forceinline__ void fwht128_store_half(
float val, half * dst_base) {
const int tid = threadIdx.x;
- const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
+ const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
if ((tid & 1) == 0) {
const half2 packed = __floats2half2_rn(val, neighbor);
*((half2 *)(dst_base + tid)) = packed;

View File

@@ -0,0 +1,65 @@
#!/bin/bash
set -ex
# Get the absolute current dir where the script is located
CURDIR=$(dirname "$(realpath $0)")
cd /
echo "CPU info:"
grep -e "model\sname" /proc/cpuinfo | head -1
grep -e "flags" /proc/cpuinfo | head -1
BINARY=buun-llama-cpp-fallback
if grep -q -e "\savx\s" /proc/cpuinfo ; then
echo "CPU: AVX found OK"
if [ -e $CURDIR/buun-llama-cpp-avx ]; then
BINARY=buun-llama-cpp-avx
fi
fi
if grep -q -e "\savx2\s" /proc/cpuinfo ; then
echo "CPU: AVX2 found OK"
if [ -e $CURDIR/buun-llama-cpp-avx2 ]; then
BINARY=buun-llama-cpp-avx2
fi
fi
# Check avx 512
if grep -q -e "\savx512f\s" /proc/cpuinfo ; then
echo "CPU: AVX512F found OK"
if [ -e $CURDIR/buun-llama-cpp-avx512 ]; then
BINARY=buun-llama-cpp-avx512
fi
fi
if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
if [ -e $CURDIR/buun-llama-cpp-grpc ]; then
BINARY=buun-llama-cpp-grpc
fi
fi
# Extend ld library path with the dir where this script is located/lib
if [ "$(uname)" == "Darwin" ]; then
export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
else
export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
if [ -d "$CURDIR/lib/rocblas/library" ]; then
export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
fi
fi
# If there is a lib/ld.so, use it
if [ -f $CURDIR/lib/ld.so ]; then
echo "Using lib/ld.so"
echo "Using binary: $BINARY"
exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
fi
echo "Using binary: $BINARY"
exec $CURDIR/$BINARY "$@"
# We should never reach this point, however just in case we do, run fallback
exec $CURDIR/buun-llama-cpp-fallback "$@"

View File

@@ -37,6 +37,14 @@ var CacheTypeOptions = []FieldOption{
{Value: "q4_1", Label: "Q4_1"},
{Value: "q5_0", Label: "Q5_0"},
{Value: "q5_1", Label: "Q5_1"},
// TurboQuant KV-cache types — accepted by the turboquant and
// buun-llama-cpp fork backends; stock llama-cpp will reject them at load.
{Value: "turbo2", Label: "Turbo2 (TurboQuant)"},
{Value: "turbo3", Label: "Turbo3 (TurboQuant)"},
{Value: "turbo4", Label: "Turbo4 (TurboQuant)"},
// Trellis-Coded Quantization variants — buun-llama-cpp only.
{Value: "turbo2_tcq", Label: "Turbo2 TCQ (buun-llama-cpp)"},
{Value: "turbo3_tcq", Label: "Turbo3 TCQ (buun-llama-cpp)"},
}
var DiffusersPipelineOptions = []FieldOption{

View File

@@ -34,6 +34,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
return []KnownBackendEntry{
{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
{Name: "buun-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with DFlash speculative decoding and TurboQuant/TCQ KV-cache quantization"},
}
}
@@ -127,7 +128,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
backend := "llama-cpp"
if b, ok := preferencesMap["backend"].(string); ok {
switch b {
case "ik-llama-cpp", "turboquant":
case "ik-llama-cpp", "turboquant", "buun-llama-cpp":
backend = b
}
}

View File

@@ -181,6 +181,23 @@ var _ = Describe("LlamaCPPImporter", func() {
Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
})
It("swaps the emitted backend to buun-llama-cpp when preferred", func() {
preferences := json.RawMessage(`{"backend": "buun-llama-cpp"}`)
details := Details{
URI: "https://example.com/my-model.gguf",
Preferences: preferences,
}
modelConfig, err := importer.Import(details)
Expect(err).ToNot(HaveOccurred())
Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: buun-llama-cpp"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).NotTo(ContainSubstring("backend: llama-cpp\n"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(modelConfig.ConfigFile).To(ContainSubstring("model: my-model.gguf"), fmt.Sprintf("Model config: %+v", modelConfig))
Expect(len(modelConfig.Files)).To(Equal(1))
Expect(modelConfig.Files[0].Filename).To(Equal("my-model.gguf"))
})
It("keeps backend: llama-cpp for unknown backend preferences", func() {
// Unknown backend values must not leak into the emitted YAML —
// we only honour the two curated drop-in replacements.
@@ -375,7 +392,7 @@ var _ = Describe("LlamaCPPImporter", func() {
})
Context("AdditionalBackends", func() {
It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
It("advertises ik-llama-cpp, turboquant, and buun-llama-cpp as drop-in replacements", func() {
entries := importer.AdditionalBackends()
names := make([]string, 0, len(entries))
@@ -384,7 +401,7 @@ var _ = Describe("LlamaCPPImporter", func() {
names = append(names, e.Name)
byName[e.Name] = e
}
Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))
Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "buun-llama-cpp"))
ik := byName["ik-llama-cpp"]
Expect(ik.Modality).To(Equal("text"))
@@ -393,6 +410,10 @@ var _ = Describe("LlamaCPPImporter", func() {
tq := byName["turboquant"]
Expect(tq.Modality).To(Equal("text"))
Expect(tq.Description).NotTo(BeEmpty())
bn := byName["buun-llama-cpp"]
Expect(bn.Modality).To(Equal("text"))
Expect(bn.Description).NotTo(BeEmpty())
})
})
})

View File

@@ -631,6 +631,83 @@ The `cache_type_k` / `cache_type_v` fields map to llama.cpp's `-ctk` / `-ctv` fl
- [Tracked branch: `feature/turboquant-kv-cache`](https://github.com/TheTom/llama-cpp-turboquant/tree/feature/turboquant-kv-cache)
### buun-llama-cpp (DFlash speculative decoding + TurboQuant/TCQ KV-cache)
[buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) is a fork-of-a-fork: spiritbuun forked `TheTom/llama-cpp-turboquant` (the `turboquant` backend above) and added two independent features on top:
1. **DFlash** — a block-diffusion speculative decoding scheme that uses a dedicated drafter model (new `DFlashDraftModel` GGUF architecture). On a target/drafter pair it emits a block of tokens per speculation step and can be combined with tree-structured verification ("DDTree") for multi-branch draft expansion.
2. **TCQ (Trellis-Coded Quantization)** — two additional KV-cache types (`turbo2_tcq`, `turbo3_tcq`) on top of the TurboQuant `turbo2` / `turbo3` / `turbo4` already shipped by the parent fork, delivering 1044% KL reduction over scalar quantization at 23 bits per value.
Like `turboquant`, this backend shares LocalAI's stock `llama-cpp` gRPC server sources — so any GGUF model that runs on `llama-cpp` also runs on `buun-llama-cpp`. Pick it over `turboquant` specifically when you want DFlash speculative decoding or the newer TCQ KV-cache variants.
#### Features
- Drop-in GGUF compatibility with upstream `llama.cpp`.
- DFlash block-diffusion speculative decoding (CUDA/Metal; no CPU fallback).
- TurboQuant KV-cache types (`turbo2`, `turbo3`, `turbo4`) inherited from the parent `turboquant` fork, plus buun-exclusive `turbo2_tcq` and `turbo3_tcq` variants.
- Same feature surface as `llama-cpp`: text generation, embeddings, tool calls, multimodal via mmproj.
- Available on CPU (AVX/AVX2/AVX512/fallback), NVIDIA CUDA 12/13, AMD ROCm/HIP, Intel SYCL f32/f16, Vulkan, and NVIDIA L4T — but note that DFlash and `turbo*` KV types have no CPU fallback and error at model-load on CPU-only builds.
#### Setup
`buun-llama-cpp` ships as a separate container image in the LocalAI backend gallery. Install it like any other backend:
```bash
local-ai backends install buun-llama-cpp
```
Or pick a specific flavor for your hardware (example tags: `cpu-buun-llama-cpp`, `cuda12-buun-llama-cpp`, `cuda13-buun-llama-cpp`, `rocm-buun-llama-cpp`, `intel-sycl-f16-buun-llama-cpp`, `vulkan-buun-llama-cpp`).
#### YAML configuration — TCQ KV-cache
To run a model with TurboQuant/TCQ quantized KV-cache, set the backend and pick a `turbo*` cache type:
```yaml
name: my-model
backend: buun-llama-cpp
parameters:
model: file.gguf
# Accepted values for the two fork-aware backends include the stock llama.cpp
# types (f16, f32, q8_0, q4_0, q4_1, q5_0, q5_1), the TurboQuant types
# (turbo2, turbo3, turbo4), and the buun-only TCQ variants (turbo2_tcq,
# turbo3_tcq). turbo3 / turbo4 / turbo*_tcq auto-enable flash_attention.
cache_type_k: turbo3
cache_type_v: turbo3_tcq
context_size: 8192
```
#### YAML configuration — DFlash speculative decoding
DFlash requires a **dedicated drafter model** in the new `DFlashDraftModel` GGUF architecture. At time of writing the only known public target/drafter pair is [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash).
```yaml
name: qwen3-dflash
backend: buun-llama-cpp
parameters:
# Target model (quantized as usual)
model: Qwen3.5-27B-Q4_K_M.gguf
# Drafter model produced by buun's convert_hf_to_gguf.py from the
# DFlashDraftModel checkpoint. Resolved relative to the models path.
draft_model: Qwen3.5-27B-DFlash.gguf
options:
# Switches the speculative pipeline from the default draft-model mode to
# DFlash (block-diffusion). Required to activate the DFlash code path.
- spec_type:dflash
# Optional tuning:
# - tree_budget:0 # 0 = flat DFlash; >0 = DDTree verification budget
# - draft_topk:1 # drafter top-K per position (1 = argmax)
# - spec_n_max:16 # cap on draft tokens per speculation step
```
Under the hood LocalAI wires `draft_model` through to the grpc-server's `params.speculative.mparams_dft.path`, and `spec_type:dflash` is forwarded through the options passthrough to buun's `common_speculative_type_from_name("dflash")`. The `tree_budget` and `draft_topk` options are buun-exclusive; they reference struct fields that only exist in buun's fork, so they're surfaced on this backend only (passing them to stock `llama-cpp` is a no-op).
#### Reference
- [spiritbuun/buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp)
- [TCQ paper / dataset](https://huggingface.co/datasets/spiritbuun/turboquant-tcq-kv-cache) — *"Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits"*
- DFlash target/drafter pair: [`z-lab/Qwen3.5-27B`](https://huggingface.co/z-lab/Qwen3.5-27B) + [`z-lab/Qwen3.5-27B-DFlash`](https://huggingface.co/z-lab/Qwen3.5-27B-DFlash)
### vLLM
[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference.

View File

@@ -20,6 +20,7 @@ LocalAI will attempt to automatically load models which are not explicitly confi
|---------|-------------|------------|------------|-----------|-------------|
| [llama.cpp](https://github.com/ggerganov/llama.cpp) | LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, [and many others](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#description) | GPT, Functions | yes | yes | CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T |
| [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) | Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek | GPT | yes | yes | CPU (AVX2+) |
| [buun-llama-cpp](https://github.com/spiritbuun/buun-llama-cpp) | llama.cpp fork with DFlash block-diffusion speculative decoding and TurboQuant/TCQ KV-cache quantization (23 bits per value). Accelerated paths are CUDA/Metal only. | GPT, Functions | yes | yes | CUDA, Metal (CPU fallback for non-turbo/non-DFlash only) |
| [vLLM](https://github.com/vllm-project/vllm) | Fast LLM serving with PagedAttention | GPT, Functions | no | yes | CPU, CUDA 12, ROCm, Intel |
| [vLLM Omni](https://github.com/vllm-project/vllm) | Unified multimodal generation (text, image, video, audio) | Multimodal GPT, Functions | no | yes | CUDA 12, ROCm |
| [transformers](https://github.com/huggingface/transformers) | HuggingFace Transformers framework | GPT, Embeddings, Multimodal | yes | yes* | CPU, CUDA 12/13, ROCm, Intel, Metal |

View File

@@ -32,6 +32,12 @@ function inferBackendPath(item) {
// via a thin wrapper Makefile. Changes to either dir should retrigger it.
return `backend/cpp/turboquant/`;
}
if (item.dockerfile.endsWith("buun-llama-cpp")) {
// buun-llama-cpp is a fork-of-a-fork (spiritbuun/buun-llama-cpp forks
// TheTom/llama-cpp-turboquant) that reuses backend/cpp/llama-cpp sources
// the same way turboquant does. Changes to either dir retrigger it.
return `backend/cpp/buun-llama-cpp/`;
}
if (item.dockerfile.endsWith("llama-cpp")) {
return `backend/cpp/llama-cpp/`;
}
@@ -138,9 +144,10 @@ async function getChangedFiles() {
// Per-backend boolean outputs
for (const [backend, pathPrefix] of allBackendPaths) {
let changed = changedFiles.some(file => file.startsWith(pathPrefix));
// turboquant reuses backend/cpp/llama-cpp sources via a thin wrapper;
// changes to either directory should retrigger its pipeline.
if (backend === "turboquant" && !changed) {
// turboquant and buun-llama-cpp reuse backend/cpp/llama-cpp sources via
// thin wrapper Makefiles; changes to that directory should retrigger
// their pipelines too.
if ((backend === "turboquant" || backend === "buun-llama-cpp") && !changed) {
changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
}
fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`);