Two more hipblas-only build failures in buun's fattn.cu, fixed under the
same patches/ infrastructure:
1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration +
TCQ codebook upload paths call the symbol variants of cudaMemcpy.
ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy*
name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the
symbol pair was never added. 15+ "use of undeclared identifier"
errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883,
905, 954, 976, 1449, 1463. Add the two missing aliases alongside
the existing memcpy block.
2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the
earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp
butterfly) and 536 (fwht128_store_half neighbor fetch) drop the
width argument that hip.h:33 requires. Add WARP_SIZE.
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Two call sites in ggml/src/ggml-cuda/argmax.cu (the top-K intra-warp
merge added by buun) use the 3-arg CUDA form __shfl_xor_sync(mask, var,
laneMask), omitting the optional width parameter. The hipification shim
at ggml/src/ggml-cuda/vendors/hip.h:33 is a function-like macro that
requires all four arguments, so hipcc fails with:
argmax.cu:265: too few arguments provided to function-like macro
invocation
note: macro '__shfl_xor_sync' defined here:
#define __shfl_xor_sync(mask, var, laneMask, width) \
__shfl_xor(var, laneMask, width)
Every other call in the same file already passes WARP_SIZE explicitly;
aligning these two with that convention fixes the hipblas build without
changing CUDA codegen (warpSize is the CUDA default).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:
fattn.cu:812: error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.
Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>