mirror of
https://github.com/mudler/LocalAI.git
synced 2026-05-24 08:38:02 -04:00
fix(buun-llama-cpp): shim atomicAdd(double*,double) for pre-sm_60 CUDA
Buun's Q² calibration path in ggml/src/ggml-cuda/fattn.cu calls
atomicAdd with a double* destination. Native double atomicAdd is only
available on CUDA compute capability 6.0 and later — LocalAI's CUDA 12
Docker image builds for the full published arch range (which includes
sm_50/sm_52), so nvcc fails with:
fattn.cu:812: error: no instance of overloaded function "atomicAdd"
matches the argument list, argument types are: (double *, double)
Add the canonical CAS-loop shim from the CUDA C Programming Guide
(B.15 Atomic Functions) guarded on __CUDA_ARCH__ < 600. On sm_60+ the
guard is false and nvcc picks up the native intrinsic as before.
Patch file lives under backend/cpp/buun-llama-cpp/patches/ and is
applied to the cloned fork tree by apply-patches.sh (the infrastructure
already put in place for exactly this class of backport).
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,46 @@
|
||||
Subject: [PATCH] ggml-cuda/fattn: provide atomicAdd(double*,double) shim for pre-sm_60
|
||||
|
||||
Buun's Q² calibration path in ggml_cuda_turbo_scale_q calls
|
||||
atomicAdd(&d_q_channel_sq_fattn[threadIdx.x], (double)(val * val));
|
||||
but native double atomicAdd is only available on compute capability 6.0
|
||||
and newer. Compiling against a CUDA arch list that includes older
|
||||
architectures (LocalAI's CUDA 12 Docker image builds for the full
|
||||
published arch range) fails with:
|
||||
|
||||
fattn.cu(812): error: no instance of overloaded function "atomicAdd"
|
||||
matches the argument list, argument types are: (double *, double)
|
||||
|
||||
Add the canonical CUDA-programming-guide shim at the top of fattn.cu so
|
||||
pre-sm_60 codegen has a definition to call. On sm_60+ the native CUDA
|
||||
intrinsic is used and the shim is elided via __CUDA_ARCH__.
|
||||
|
||||
--- a/ggml/src/ggml-cuda/fattn.cu
|
||||
+++ b/ggml/src/ggml-cuda/fattn.cu
|
||||
@@ -7,6 +7,27 @@
|
||||
|
||||
#include <atomic>
|
||||
|
||||
+// Pre-sm_60 double atomicAdd shim. Native double atomicAdd(double*,double)
|
||||
+// is only available on CUDA compute capability 6.0+ (see CUDA C Programming
|
||||
+// Guide, B.15 Atomic Functions). Buun's Q² calibration path below calls
|
||||
+// atomicAdd with a double*; without this definition, nvcc fails to find a
|
||||
+// matching overload whenever the compile target list includes pre-sm_60
|
||||
+// architectures. The standard CAS loop implementation below matches the
|
||||
+// semantics of the native intrinsic.
|
||||
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 600
|
||||
+static __device__ double atomicAdd(double * address, double val) {
|
||||
+ unsigned long long int * address_as_ull = (unsigned long long int *)address;
|
||||
+ unsigned long long int old = *address_as_ull;
|
||||
+ unsigned long long int assumed;
|
||||
+ do {
|
||||
+ assumed = old;
|
||||
+ old = atomicCAS(address_as_ull, assumed,
|
||||
+ __double_as_longlong(val + __longlong_as_double(assumed)));
|
||||
+ } while (assumed != old);
|
||||
+ return __longlong_as_double(old);
|
||||
+}
|
||||
+#endif
|
||||
+
|
||||
// InnerQ: update the fattn-side inverse scale array from host (all devices)
|
||||
void turbo_innerq_update_fattn_scales(const float * scale_inv) {
|
||||
int cur_device;
|
||||
Reference in New Issue
Block a user