From 9787bee48b79da90fa6b70480cefe00a055a0312 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Fri, 24 Apr 2026 20:09:36 +0000 Subject: [PATCH] fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto --- .../0003-hip-add-memcpy-symbol-aliases.patch | 24 +++++++++++++ ...attn-fwht128-shfl-xor-sync-add-width.patch | 36 +++++++++++++++++++ 2 files changed, 60 insertions(+) create mode 100644 backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch create mode 100644 backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch diff --git a/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch b/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch new file mode 100644 index 000000000..2756de491 --- /dev/null +++ b/backend/cpp/buun-llama-cpp/patches/0003-hip-add-memcpy-symbol-aliases.patch @@ -0,0 +1,24 @@ +Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts + +Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use +cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in +ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy +family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is +missing the symbol variants. Building with hipcc therefore fails with +15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors. + +Add the two missing aliases alongside the existing memcpy block. HIP +provides hipMemcpy{To,From}Symbol with the same signature as CUDA's +equivalents, so this is a straight name substitution. + +--- a/ggml/src/ggml-cuda/vendors/hip.h ++++ b/ggml/src/ggml-cuda/vendors/hip.h +@@ -85,6 +85,8 @@ + #define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice + #define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost + #define cudaMemcpyHostToDevice hipMemcpyHostToDevice ++#define cudaMemcpyToSymbol hipMemcpyToSymbol ++#define cudaMemcpyFromSymbol hipMemcpyFromSymbol + #define cudaMemcpyKind hipMemcpyKind + #define cudaMemset hipMemset + #define cudaMemsetAsync hipMemsetAsync diff --git a/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch b/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch new file mode 100644 index 000000000..0b19a03bf --- /dev/null +++ b/backend/cpp/buun-llama-cpp/patches/0004-fattn-fwht128-shfl-xor-sync-add-width.patch @@ -0,0 +1,36 @@ +Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls + +Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in +the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half) +use the 3-arg CUDA form and omit the `width` argument that the HIP +function-like macro in vendors/hip.h:33 requires. Hipcc fails with: + + fattn.cu:512: too few arguments provided to function-like macro + invocation + note: macro '__shfl_xor_sync' defined here: + #define __shfl_xor_sync(mask, var, laneMask, width) \ + __shfl_xor(var, laneMask, width) + +Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the +default); HIP now matches the macro arity. + +--- a/ggml/src/ggml-cuda/fattn.cu ++++ b/ggml/src/ggml-cuda/fattn.cu +@@ -509,7 +509,7 @@ + // Intra-warp passes: shuffle xor with stride h, no smem, no sync. + #pragma unroll + for (int h = 1; h <= 16; h *= 2) { +- const float other = __shfl_xor_sync(0xFFFFFFFF, val, h); ++ const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE); + val = (tid & h) ? (other - val) : (val + other); + } + +@@ -533,7 +533,7 @@ + static __device__ __forceinline__ void fwht128_store_half( + float val, half * dst_base) { + const int tid = threadIdx.x; +- const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1); ++ const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE); + if ((tid & 1) == 0) { + const half2 packed = __floats2half2_rn(val, neighbor); + *((half2 *)(dst_base + tid)) = packed;