mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-20 06:35:41 -04:00
fix(buun-llama-cpp): shim cudaMemcpy{To,From}Symbol + WARP_SIZE on fwht128 shuffles
Two more hipblas-only build failures in buun's fattn.cu, fixed under the same patches/ infrastructure: 1. cudaMemcpyToSymbol / cudaMemcpyFromSymbol — buun's Q² calibration + TCQ codebook upload paths call the symbol variants of cudaMemcpy. ggml/src/ggml-cuda/vendors/hip.h aliases every other cudaMemcpy* name (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but the symbol pair was never added. 15+ "use of undeclared identifier" errors across fattn.cu lines 40, 54, 74-76, 94, 100-101, 371, 883, 905, 954, 976, 1449, 1463. Add the two missing aliases alongside the existing memcpy block. 2. __shfl_xor_sync fwht128 calls — same 3-arg omission pattern as the earlier argmax top-K fix. Lines 512 (ggml_cuda_fwht128 intra-warp butterfly) and 536 (fwht128_store_half neighbor fetch) drop the width argument that hip.h:33 requires. Add WARP_SIZE. Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,24 @@
|
||||
Subject: [PATCH] ggml-cuda/vendors/hip: alias cudaMemcpy{To,From}Symbol to hip counterparts
|
||||
|
||||
Buun's Q² calibration + TCQ codebook upload paths in fattn.cu use
|
||||
cudaMemcpyToSymbol / cudaMemcpyFromSymbol. The HIP-compat header in
|
||||
ggml/src/ggml-cuda/vendors/hip.h already aliases the scalar cudaMemcpy
|
||||
family (cudaMemcpy, cudaMemcpyAsync, cudaMemcpy2DAsync, …) but is
|
||||
missing the symbol variants. Building with hipcc therefore fails with
|
||||
15+ "use of undeclared identifier 'cudaMemcpyToSymbol'" errors.
|
||||
|
||||
Add the two missing aliases alongside the existing memcpy block. HIP
|
||||
provides hipMemcpy{To,From}Symbol with the same signature as CUDA's
|
||||
equivalents, so this is a straight name substitution.
|
||||
|
||||
--- a/ggml/src/ggml-cuda/vendors/hip.h
|
||||
+++ b/ggml/src/ggml-cuda/vendors/hip.h
|
||||
@@ -85,6 +85,8 @@
|
||||
#define cudaMemcpyDeviceToDevice hipMemcpyDeviceToDevice
|
||||
#define cudaMemcpyDeviceToHost hipMemcpyDeviceToHost
|
||||
#define cudaMemcpyHostToDevice hipMemcpyHostToDevice
|
||||
+#define cudaMemcpyToSymbol hipMemcpyToSymbol
|
||||
+#define cudaMemcpyFromSymbol hipMemcpyFromSymbol
|
||||
#define cudaMemcpyKind hipMemcpyKind
|
||||
#define cudaMemset hipMemset
|
||||
#define cudaMemsetAsync hipMemsetAsync
|
||||
@@ -0,0 +1,36 @@
|
||||
Subject: [PATCH] ggml-cuda/fattn: pass WARP_SIZE to fwht128 __shfl_xor_sync calls
|
||||
|
||||
Same issue as the argmax top-K fix: two __shfl_xor_sync call sites in
|
||||
the FWHT-128 butterfly kernels (ggml_cuda_fwht128 and fwht128_store_half)
|
||||
use the 3-arg CUDA form and omit the `width` argument that the HIP
|
||||
function-like macro in vendors/hip.h:33 requires. Hipcc fails with:
|
||||
|
||||
fattn.cu:512: too few arguments provided to function-like macro
|
||||
invocation
|
||||
note: macro '__shfl_xor_sync' defined here:
|
||||
#define __shfl_xor_sync(mask, var, laneMask, width) \
|
||||
__shfl_xor(var, laneMask, width)
|
||||
|
||||
Add WARP_SIZE to both calls. CUDA codegen is unchanged (warpSize is the
|
||||
default); HIP now matches the macro arity.
|
||||
|
||||
--- a/ggml/src/ggml-cuda/fattn.cu
|
||||
+++ b/ggml/src/ggml-cuda/fattn.cu
|
||||
@@ -509,7 +509,7 @@
|
||||
// Intra-warp passes: shuffle xor with stride h, no smem, no sync.
|
||||
#pragma unroll
|
||||
for (int h = 1; h <= 16; h *= 2) {
|
||||
- const float other = __shfl_xor_sync(0xFFFFFFFF, val, h);
|
||||
+ const float other = __shfl_xor_sync(0xFFFFFFFF, val, h, WARP_SIZE);
|
||||
val = (tid & h) ? (other - val) : (val + other);
|
||||
}
|
||||
|
||||
@@ -533,7 +533,7 @@
|
||||
static __device__ __forceinline__ void fwht128_store_half(
|
||||
float val, half * dst_base) {
|
||||
const int tid = threadIdx.x;
|
||||
- const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1);
|
||||
+ const float neighbor = __shfl_xor_sync(0xFFFFFFFF, val, 1, WARP_SIZE);
|
||||
if ((tid & 1) == 0) {
|
||||
const half2 packed = __floats2half2_rn(val, neighbor);
|
||||
*((half2 *)(dst_base + tid)) = packed;
|
||||
Reference in New Issue
Block a user