Two more hipblas-only build failures in buun's fattn.cu, fixed under the
same patches/ infrastructure:
1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration +
TCQ codebook upload paths call the symbol variants of cudaMemcpy.
ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy*
name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the
symbol pair was never added. 15+ "use of undeclared identifier"
errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883,
905, 954, 976, 1449, 1463. Add the two missing aliases alongside
the existing memcpy block.
2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the
earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp
butterfly) and 536 (fwht128_store_half neighbor fetch) drop the
width argument that hip.h:33 requires. Add WARP_SIZE.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:
argmax.cu:265: too few arguments provided to function-like macro
invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:
fattn.cu:812: error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.
Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Previous substitution kept the call as 5 args, but buun predates the
upstream refactor that also *added* the logit_bias_eog parameter to
params_from_json_cmpl — buun's signature is still the 4-arg form
(const llama_vocab*, const common_params&, int, const json&)
and it still derives logit_bias_eog internally from the common_params.
Replace the substitution with a line-delete. Guard matches both the
original call (ctx_server.get_meta().logit_bias_eog) and the previously
substituted form (params_base.sampling.logit_bias_eog) so the script
stays safe across re-runs and whatever state the tree was left in.
Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
LocalAI's shared grpc-server.cpp reaches
ctx_server.get_meta().logit_bias_eog twice (the twin params_from_json_cmpl
callsites). That accessor was added to server_context_meta upstream after
buun's 2026-04-05 fork-point, so compiling against buun errors with
'struct server_context_meta' has no member named 'logit_bias_eog'.
Rewrite the call sites — only in the buun grpc-server.cpp copy — to source
the vector from params_base.sampling.logit_bias_eog instead. That vector is
the underlying data the upstream meta accessor eventually returns (buun
still carries common_params_sampling::logit_bias_eog at common.h:280), so
the substitution yields identical behavior on both trees.
The sed is guarded by a grep for the call site, so this patch is
self-disabling once buun rebases past the upstream refactor.
Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
spiritbuun/buun-llama-cpp is a fork of TheTom/llama-cpp-turboquant that adds
two independent features on top: DFlash block-diffusion speculative decoding
(via a dedicated DFlashDraftModel GGUF arch) and two extra TCQ KV-cache
variants (turbo2_tcq, turbo3_tcq) on top of TurboQuant's turbo2/turbo3/turbo4.
Follows the turboquant thin-wrapper pattern — reuses backend/cpp/llama-cpp
grpc-server sources verbatim, patches only the build copy to extend the KV
allow-list and wire up buun-exclusive tree_budget / draft_topk options.
DraftModel is already wired end-to-end (proto field 39 → params.speculative),
so DFlash activation only needs the existing options passthrough
(spec_type:dflash) plus the drafter path in draft_model.
CacheTypeOptions now surfaces the five turbo* values so the React UI dropdown
shows them — benefits turboquant too (previously users had to type them in
YAML manually).
Assisted-by: Claude:Opus-4.7 [Read] [Edit] [Bash] [WebFetch]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>